Open yarikoptic opened 1 year ago
let's check this with the dandischema aggregate function run locally on asset metadata to ensure this is not a dandischema issue. the aggregate summary function works a list of asset metadata.
then we can circle back here to see if this is some postgres query issue that should not be including some assets. also we should ask submitters to make sure there is a match between local and remote files for complete dandisets, or that they confirm the right number of files.
seems like a schema issue, while loading assets.jsonld
from s3 bucket.
In [1]: from dandischema.metadata import aggregate_assets_summary
In [2]: import json
In [3]: with open("assets.jsonld") as fp:
...: data = json.load(fp)
...:
In [4]: len(data)
Out[4]: 1013
In [5]: aggregate_assets_summary(data)
Out[5]:
{'schemaKey': 'AssetsSummary',
'numberOfBytes': 9004401256,
'numberOfFiles': 1013,
'numberOfSubjects': 1097,
'dataStandard': [{'schemaKey': 'StandardsType',
'identifier': 'RRID:SCR_015242',
'name': 'Neurodata Without Borders (NWB)'}],
'approach': [{'schemaKey': 'ApproachType', 'name': 'behavioral approach'}],
'measurementTechnique': [{'schemaKey': 'MeasurementTechniqueType',
'name': 'analytical technique'},
{'schemaKey': 'MeasurementTechniqueType', 'name': 'behavioral technique'}],
'variableMeasured': ['Position', 'ProcessingModule', 'SpatialSeries'],
'species': [{'schemaKey': 'SpeciesType',
'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_7227',
'name': 'Drosophila melanogaster - Fruit fly'},
{'schemaKey': 'SpeciesType',
'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_28584',
'name': 'Drosophila suzukii'}]}
@yarikoptic - it's very likely happening because of normalization of subject identifier. perhaps something has changed between the way we were normalizing before and now. dandischema assumes subject id in path can be generated simply by replacing "_" with "-". i suspect cli is doing more than that and stripping other characters.
that normalization has to be moved to dandi schema and used in the asset summary generation. moving this issue to dandi-schema.
sorry -- I am not following how normalization has anything to do with it since we have higher number in the summary than there is really of sub- folders on the drive (1097 > 1013). So needs to look on where it comes up with "extra subjects", likely counting the same files twice...
I see now what you mean about normalization e.g. '0p8--1p4--CS-fly#-16', '0p8%-1p4%-CS-fly#-19',
...
@yarikoptic - can you point me to the function in dandi-cli where the filename normalization occurs? as a start i will first simply copy the function to dandi-schema, or you could do it and replace the place where it is used.
this is where the mismatches are:
from asset metadata record: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L302
the other place we are getting subjects from is the asset path: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L314
the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.
do you mean where those %
could come from? isn't all the code involved for the used above aggregate_assets_summary is in dandi-schema, thus gotcha is somewhere there?
but as for what creates filenames in dandi-cli -- AFAIK only organize
and the code is at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/organize.py#L68:5
2. the other place we are getting subjects from is the asset path: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L314
the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.
that is where I thought (didn't yet) to check/propose fix -- to not go through all parts of the file path, but only look at the top level folder names, since with above, some inconsistent dandiset with sub-1/sub-2.dat
would result in 2 subjects.
it's this function that is taking subject from metadata and sanitizing for a string in a filename: https://github.com/dandi/dandi-cli/blob/eef8443a16f0968891f4ddfca43d663df3f07f2b/dandi/organize.py#L392
i'll copy that over. that should at least temporarily fix the issue.
regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid. that part was created as we did not have any bids validation verification in place nor metadata extraction. so indeed cli should extract and populate relevant metadata from a bids asset path or other relevant files. but there are a few things this function is doing that should be done by asset metadata extractor.
regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid.
well, we generate summaries for invalid ones as well. The question is on how robust we should be in such cases. And IMHO relying just on sub-
at top level sounds like the easiest and most robust approach.
the only cons (or may be pros) -- we would not pick up subjects from within under the folders like derivatives/
etc, so some datasets of that kind would not be considered. So the "robustification" could be -- take first subject indicator found in the path, not all of them ;)
I don't think we neither fixed or boiled this issue down.
I improved script to https://github.com/dandi/dandisets/blob/draft/tools/check_numberOfSubjects so it states also last recent published version, which might not correspond to draft but still -- good hint that we assumed it all kosher . And we have
dandi@drogon:/mnt/backup/dandi/dandisets$ tools/check_numberOfSubjects | grep -v 'version: null'
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121 MRP version: "0.220708.1652"
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4 MRP version: "0.230302.2331"
/mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13 MRP version: "0.231010.1811"
/mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9 MRP version: "0.240402.2118"
/mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9 MRP version: "0.240402.2115"
/mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29 MRP version: "0.240327.2229"
and for https://dandiarchive.org/dandiset/000293/draft we do not have any warnings etc. So I think we simply do not validate consistency of "dandi-layout" and nwb metadata.
but overall it is indeed a "wild west" there on how people seems simply do not care to make their dandisets "valid". I think we would benefit from some formalized way to provide feedback to the dandiset owners for all those which remain invalid and authors do not really concern themselves.
so that relates to discussion above on sanitizaiton -- but it leads to doublecount now if metadata subj id is different from the filename one, and that is a bug.
It was mentioned by the dandiset author for https://dandiarchive.org/dandiset/000212 that there is only 1013 files but summary has 1097 subjects!
I did check across all dandisets for which we have datalad representations:
so it seems that the issue manifests itself quite wildly.