Open fdschneider opened 6 years ago
No super sure about including more datasets in the package itself (I don't know if there is an "ideal" size for a package). If we do, they should be small, I guess. We can alternatively/also provide a tutorial with more examples on how to handle different trait datasets using the package (not only the CC.0 ones).
Ha! Trick is, we're not including the datasets, just provide code to pull the datasets from their source:
See files in data.R. Only when you call data(carabids)
the file is downloaded and made available for use. The package remains small. The user decides what to download.
The package vignette contains plenty of advice on how to harmonize own data, or data from other sources.
Then it's all good!! Sorry, I need to dive a bit more into the package!
@fdschneider I started to add more datasets in https://github.com/caterinap/traitdataform/tree/master/data. See if it's fine, I can continue adding more later in the week. Also added more entries in the spreadsheet and a new column indicating if the dataset is in the package.
Hi @fdschneider, you're initiative seems really cool! I hope to use it soon ;)
A lot of work has been done by people who built Eco Data Retriever (http://www.data-retriever.org/, Github Repo) you can see the available datasets here.
I'm also thinking about the trait
package by rOpenScience. Maybe you could use some wrappers to those already built tools?
@Rekyt Thanks. Yes, I looked into those. We basically use the same idea as Retriever when pulling example datasets from the original sources on Figshare or wherever. The 'traits' package is great for tapping APIs of more extensive databases. There is also the package 'TR8'.
It would be cool to have wrappers for these data sources that add harmonization on top.
Ok, now all CC.0 are in the package, on the same form as the "carabids" one. On windows I did not get errors when building the package (only warnings).
Some remarks:
Have a look and let me know if you want to add/remove/change anything!
Great, thanks.
I will pull and test it.
I wasn't aware that some of those datasets have so many traits. Great job mapping them to the ontologies.
However, I just noticed that the URIs in Nadjas list are not correct. They should correspond to the URL with headings: e.g. https://ecologicaltraitdata.github.io/TraitDataList/#age_at_reproduction
.
We should fix this in the TraitDataList repository, @nadjasimons.
Furthermore, I thought that some of the cryptic trait names might be replaced by more intuitive trait names.
E.g. if the thesaurus call states
X10.2_SocialGrpSize = traitdataform::as.trait("social_group_size", expectedUnit = NA,
valueType = "numeric"),
The function standardize()
will keep the original name in traitName
but replace it with the easier one in traitNameStd
.
The CC BY 4.0 data could be added in the future in just the same way, since we always state the correct reference.
I think the Ricklefs data on passerine birds can't be included since it is not labelled as public domain or CC by. Sorry, that license statement in the documentation is my fault, I guess. I already removed it from the current version.
ok, so I will:
Concerning the passerine, I actually checked before adding it and in the metadata (which is a word file in the supplementary) he states:
- Copyright restrictions: None
- Proprietary restrictions: None
- Costs: None
So I guess that we could keep it.
Ok, thanks. No pressure. Whenever you find time.
The passerines: I'm relieved. After I was assured that the data are open by a colleaque, I was desperately looking for this disclaimer but didn't find it. Great 'bad example' for open data labelling.
I fixed URIs in the trait data list
For now this is put on halt because it overlaps with functionality provided by Will Pearses natdb package (@willpearse). They include 100+ datasets with short recipes (see this file), and in the process fix some major heterogeneity in the data (like replacing abbreviations with species names or adding units). I did not have the time to investigate how the data are processed into a virtual database. We should figure out how the two packages can complement each other.
Regardless, I would like to include Caterinas Pull request for v1.0 to have some more example datasets to draw from.
Sorry to have been a bit slow to reply to this mention.
We have a plan, right now, to get a citable bioRxiv paper for MADworld (which is going to combine NACDB and NATDB) up ~late January early February. We are definitely interested in inter-operability, and I would love to make a wrapper linking your data structure into NATDB format. As I've mentioned before, but don't mind saying again, I think what you've done here is fantastic!!!
On Mon, 26 Nov 2018 at 09:18, Florian Schneider notifications@github.com wrote:
For now this is put on halt because it overlaps with functionality provided by Will Pearses natdb package https://github.com/willpearse/natdb (@willpearse https://github.com/willpearse). They include 100+ datasets with short recipes (see this file https://github.com/willpearse/natdb/blob/master/R/downloads.R), and in the process fix some major heterogeneity in the data (like replacing abbreviations with species names or adding units). I did not have the time to investigate how the data are processed into a virtual database. We should figure out how the two packages can complement each other.
Regardless, I would like to include Caterinas Pull request for v1.0 to have some more example datasets to draw from.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EcologicalTraitData/traitdataform/issues/20#issuecomment-441699711, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLcUi-l4YVa5aRl9leCv-fzs84DTOyNks5uzBRJgaJpZM4Qe0U6 .
Thanks Will, and sorry for not keeping up with our earlier e-mail discussion. I wanted to get a first functional version out before investigating further on interfaces with other tools. Let me know how I can help making this work seamlessly with your package.
No worries; that's just life! :D
Makes sense to get something out that's functional first. When you have that ready, ping me and I will (1) take a look and then (2) figure out a path forward.
the package should provide more datasets from the living spreadshet (https://github.com/fdschneider/bexis_traits/issues/20).
A standardised version of each dataset should be provided as well (linking to trait Thesauri and taxon Ontologies).