dandi / helpdesk

Repository to track help tickets from users.
3 stars 0 forks source link

Subject species validation #81

Closed haesemeyer closed 1 year ago

haesemeyer commented 1 year ago

Bug description

"Danio rerio" is not in the species database and in spite of suggestion when running dandi upload the NCBI taxonomy URL format is not accepted for subject. species anymore.

Expected behaviour

Fix help text on dandi-upload validation stating that the only accepted species definition is now the latin description - my personal preference would be to still allow taxonomy URLs but if this has been removed for a good reason, the help text should be updated instead.

Actual behaviour

dandi upload with the species Danio rerio suggests to either have this species added (suprised it isn't ;-) ) or supply a taxonomy URL instead. But while I had this working in the past that does not appear to be a valid option anymore.

How to reproduce

Call dandi upload on an NWB file with a species not known to dandi.

satra commented 1 year ago

pinging @bendichter and @CodyCBakerPhD as this may be an issue inside nwb inspector. the dandi schema continues to support NCBITaxon urls (but they have to be chosen from the NCBI taxonomy).

i'm assuming this is what you added: http://purl.obolibrary.org/obo/NCBITaxon_7955

haesemeyer commented 1 year ago

Yes that is exactly the URL that I used (and this worked in the past).

CodyCBakerPhD commented 1 year ago

@satra @haesemeyer indeed the Inspector currently only supports Latin binomial as that is more immediately human-readable whereas the ontology link someone would have to follow to discover the name - happy to extend this Inspector check to include the taxonomy link to NCBI but I think there’s some overlap here with external ontology linkage that the PyNWB team was working on a while back (unsure what status of that project is)

haesemeyer commented 1 year ago

@CodyCBakerPhD I see that does make sense. What I like about the link is, that it is the most comprehensive ID of a species - and in fact DANDI seems to convert that link into the species name already for the dandiset description (that's what happened with data I previously uploaded).

@satra In the meantime I think the debug message should be changed upon dandi upload since it currently refers to doing something that is not supported as part of the upload process. Also, the error the taxonomy URL currently generates is not really identified in the log-file. The process just fails, printing "failed validation" to the command line but without a clear message in the log as to what actually failed. The only reason I found out that the URL was the problem, was because I changed the species to "Mus musculus" and then the upload succeeded.

I wonder if the species support on DANDI could be expanded to the full set on the NCBI taxonomy database by scraping that set.

Cheers, Martin

satra commented 1 year ago

@CodyCBakerPhD - i think the nwb inspector should either degrade this to a warning or accommodate the dandi specific requirements. indeed dandi converts the url to a human readable component that one can see in the assets summary and in the asset metadata.

as @haesemeyer said, we much prefer the more authoritative and precise link, and since the inspector provides a dandi-biased config, i would suggest allowing this check. in fact we would raise an error if the species term used a latin name that wasn't in the dandi dictionary at present. and since every species even amongst the ones we know have several latin variants, we opted for the url as preference. once external ontologies are implemented in nwb with full support, we would be happy to support that, but till then we prefer either a known entity that we can do a reverse lookup on, or the precise url.

@haesemeyer - we were trying to be both restrictive to species and allowing for additional species through the NCBI url rather than allow text. i have just sent a PR to add zebra fish: https://github.com/dandi/dandi-cli/pull/1129

@yarikoptic - is there something in the CLI that can limit this inspector validation step so that @haesemeyer can move ahead? also merge PR and release if not.

haesemeyer commented 1 year ago

@satra Thanks for working on this. I guess the change isn't life yet (see below). I'm not in a rush on this but plan to upload a chunk of data in the coming weeks. I fully agree with you that allowing the URL again would make the most sense but otherwise I'll switch to "danio rerio" as an identifier.

'status': 'skipped', 'message': 'failed to extract metadata: Cannot interpret species field: danio rerio. Please contact help@dandiarchive.org to add your species. You can also put the entire url from NCBITaxon (http://www.ontobee.org/ontology/NCBITaxon) into your species field in your NWB file. For example: http://purl.obolibrary.org/obo/NCBITaxon_9606 is the url for the species Homo sapiens. Please note that this url is case sensitive.'

CodyCBakerPhD commented 1 year ago

Started work on https://github.com/NeurodataWithoutBorders/nwbinspector/pull/290 to allow this at the Inspector level. Probably a lot easier/faster than trying to adjust the dandi-cli as per

@yarikoptic - is there something in the CLI that can limit this inspector validation step so that @haesemeyer can move ahead? also merge PR and release if not.

haesemeyer commented 1 year ago

Thanks this seems all fixed now. I was just able to upload with NCBI taxonomy URL.

Cheers, Martin

yarikoptic commented 1 year ago

@yarikoptic - is there something in the CLI that can limit this inspector validation step so that @haesemeyer can move ahead? also merge PR and release if not.

For future reference, discouraged but possible (see the last two options)

$> dandi upload --help | grep -e --validation
  --validation [require|skip|ignore]