dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
5 stars 8 forks source link

Incorrect summaries (at least number of subjects) in some dandisets #172

Open yarikoptic opened 1 year ago

yarikoptic commented 1 year ago

It was mentioned by the dandiset author for https://dandiarchive.org/dandiset/000212 that there is only 1013 files but summary has 1097 subjects!

I did check across all dandisets for which we have datalad representations:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0*; do ( cd $d; n1=$(awk '/numberOfSubjects:/{print $2}' dandiset.yaml || echo 0 ); n2=$(/bin/ls -1d sub-* 2>/dev/null | wc -l || echo 0); [ -n "$n1" ] && { [ "$n1" = "$n2" ] || echo "`pwd`: Mismatch $n1 != $n2"; } ); done
/mnt/backup/dandi/dandisets/000025: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000043: Mismatch 22 != 13
/mnt/backup/dandi/dandisets/000051: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000052: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000058: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000168: Mismatch 30 != 0
/mnt/backup/dandi/dandisets/000212: Mismatch 1097 != 1013
/mnt/backup/dandi/dandisets/000228: Mismatch 64 != 32
/mnt/backup/dandi/dandisets/000246: Mismatch 12 != 11
/mnt/backup/dandi/dandisets/000288: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121
/mnt/backup/dandi/dandisets/000297: Mismatch 197 != 118
/mnt/backup/dandi/dandisets/000341: Mismatch 310 != 0
/mnt/backup/dandi/dandisets/000362: Mismatch 11 != 0
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4

so it seems that the issue manifests itself quite wildly.

satra commented 1 year ago

let's check this with the dandischema aggregate function run locally on asset metadata to ensure this is not a dandischema issue. the aggregate summary function works a list of asset metadata.

then we can circle back here to see if this is some postgres query issue that should not be including some assets. also we should ask submitters to make sure there is a match between local and remote files for complete dandisets, or that they confirm the right number of files.

satra commented 1 year ago

seems like a schema issue, while loading assets.jsonld from s3 bucket.

In [1]: from dandischema.metadata import aggregate_assets_summary

In [2]: import json

In [3]: with open("assets.jsonld") as fp:
   ...:     data = json.load(fp)
   ...: 

In [4]: len(data)
Out[4]: 1013

In [5]: aggregate_assets_summary(data)
Out[5]: 
{'schemaKey': 'AssetsSummary',
 'numberOfBytes': 9004401256,
 'numberOfFiles': 1013,
 'numberOfSubjects': 1097,
 'dataStandard': [{'schemaKey': 'StandardsType',
   'identifier': 'RRID:SCR_015242',
   'name': 'Neurodata Without Borders (NWB)'}],
 'approach': [{'schemaKey': 'ApproachType', 'name': 'behavioral approach'}],
 'measurementTechnique': [{'schemaKey': 'MeasurementTechniqueType',
   'name': 'analytical technique'},
  {'schemaKey': 'MeasurementTechniqueType', 'name': 'behavioral technique'}],
 'variableMeasured': ['Position', 'ProcessingModule', 'SpatialSeries'],
 'species': [{'schemaKey': 'SpeciesType',
   'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_7227',
   'name': 'Drosophila melanogaster - Fruit fly'},
  {'schemaKey': 'SpeciesType',
   'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_28584',
   'name': 'Drosophila suzukii'}]}
satra commented 1 year ago

@yarikoptic - it's very likely happening because of normalization of subject identifier. perhaps something has changed between the way we were normalizing before and now. dandischema assumes subject id in path can be generated simply by replacing "_" with "-". i suspect cli is doing more than that and stripping other characters.

that normalization has to be moved to dandi schema and used in the asset summary generation. moving this issue to dandi-schema.

yarikoptic commented 1 year ago

sorry -- I am not following how normalization has anything to do with it since we have higher number in the summary than there is really of sub- folders on the drive (1097 > 1013). So needs to look on where it comes up with "extra subjects", likely counting the same files twice...

yarikoptic commented 1 year ago

I see now what you mean about normalization e.g. '0p8--1p4--CS-fly#-16', '0p8%-1p4%-CS-fly#-19',...

satra commented 1 year ago

@yarikoptic - can you point me to the function in dandi-cli where the filename normalization occurs? as a start i will first simply copy the function to dandi-schema, or you could do it and replace the place where it is used.

this is where the mismatches are:

  1. from asset metadata record: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L302

  2. the other place we are getting subjects from is the asset path: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L314

the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.

yarikoptic commented 1 year ago

do you mean where those % could come from? isn't all the code involved for the used above aggregate_assets_summary is in dandi-schema, thus gotcha is somewhere there?

but as for what creates filenames in dandi-cli -- AFAIK only organize and the code is at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/organize.py#L68:5

2. the other place we are getting subjects from is the asset path: https://github.com/dandi/dandi-schema/blob/b310e3e1f6745b7fb5df47daeaabadbc72a7f78d/dandischema/metadata.py#L314

the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.

that is where I thought (didn't yet) to check/propose fix -- to not go through all parts of the file path, but only look at the top level folder names, since with above, some inconsistent dandiset with sub-1/sub-2.dat would result in 2 subjects.

satra commented 1 year ago

it's this function that is taking subject from metadata and sanitizing for a string in a filename: https://github.com/dandi/dandi-cli/blob/eef8443a16f0968891f4ddfca43d663df3f07f2b/dandi/organize.py#L392

i'll copy that over. that should at least temporarily fix the issue.

regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid. that part was created as we did not have any bids validation verification in place nor metadata extraction. so indeed cli should extract and populate relevant metadata from a bids asset path or other relevant files. but there are a few things this function is doing that should be done by asset metadata extractor.

yarikoptic commented 1 year ago

regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid.

well, we generate summaries for invalid ones as well. The question is on how robust we should be in such cases. And IMHO relying just on sub- at top level sounds like the easiest and most robust approach.

yarikoptic commented 1 year ago

the only cons (or may be pros) -- we would not pick up subjects from within under the folders like derivatives/ etc, so some datasets of that kind would not be considered. So the "robustification" could be -- take first subject indicator found in the path, not all of them ;)

yarikoptic commented 3 months ago

I don't think we neither fixed or boiled this issue down.

looking down at current output -- we have some times number greater and some time lesser - but overall we have more inconsistencies reported than before! ```shell dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0*; do ( cd $d; n1=$(awk '/numberOfSubjects:/{print $2}' dandiset.yaml || echo 0 ); n2=$(/bin/ls -1d sub-* 2>/dev/null | wc -l || echo 0); [ -n "$n1" ] && { [ "$n1" = "$n2" ] || echo "`pwd`: Mismatch $n1 != $n2"; } ); done /mnt/backup/dandi/dandisets/000025: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000043: Mismatch 22 != 13 /mnt/backup/dandi/dandisets/000051: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000052: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000058: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000168: Mismatch 30 != 0 /mnt/backup/dandi/dandisets/000212: Mismatch 1097 != 1013 /mnt/backup/dandi/dandisets/000228: Mismatch 64 != 32 /mnt/backup/dandi/dandisets/000246: Mismatch 146 != 101 /mnt/backup/dandi/dandisets/000288: Mismatch 12 != 9 /mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121 /mnt/backup/dandi/dandisets/000297: Mismatch 197 != 118 /mnt/backup/dandi/dandisets/000341: Mismatch 310 != 0 /mnt/backup/dandi/dandisets/000342: Mismatch 3 != 2 /mnt/backup/dandi/dandisets/000344: Mismatch 4 != 3 /mnt/backup/dandi/dandisets/000361: Mismatch 37 != 30 /mnt/backup/dandi/dandisets/000362: Mismatch 11 != 0 /mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4 /mnt/backup/dandi/dandisets/000541: Mismatch 19 != 21 /mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13 /mnt/backup/dandi/dandisets/000582: Mismatch 14 != 15 /mnt/backup/dandi/dandisets/000686: Mismatch 22 != 18 /mnt/backup/dandi/dandisets/000687: Mismatch 8 != 10 /mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9 /mnt/backup/dandi/dandisets/000696: Mismatch 5 != 6 /mnt/backup/dandi/dandisets/000710: Mismatch 4 != 6 /mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9 /mnt/backup/dandi/dandisets/000715: Mismatch 9 != 10 /mnt/backup/dandi/dandisets/000719: Mismatch 1 != 3 /mnt/backup/dandi/dandisets/000724: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000731: Mismatch 2 != 0 /mnt/backup/dandi/dandisets/000766: Mismatch 3 != 2 /mnt/backup/dandi/dandisets/000776: Mismatch 37 != 38 /mnt/backup/dandi/dandisets/000873: Mismatch 1 != 2 /mnt/backup/dandi/dandisets/000874: Mismatch 1 != 0 /mnt/backup/dandi/dandisets/000875: Mismatch 12 != 9 /mnt/backup/dandi/dandisets/000888: Mismatch 104 != 52 /mnt/backup/dandi/dandisets/000889: Mismatch 104 != 52 /mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29 /mnt/backup/dandi/dandisets/000946: Mismatch 37 != 38 ```

I improved script to https://github.com/dandi/dandisets/blob/draft/tools/check_numberOfSubjects so it states also last recent published version, which might not correspond to draft but still -- good hint that we assumed it all kosher . And we have

dandi@drogon:/mnt/backup/dandi/dandisets$ tools/check_numberOfSubjects | grep -v 'version: null'
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121   MRP version: "0.220708.1652"
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4   MRP version: "0.230302.2331"
/mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13   MRP version: "0.231010.1811"
/mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9   MRP version: "0.240402.2118"
/mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9   MRP version: "0.240402.2115"
/mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29   MRP version: "0.240327.2229"

and for https://dandiarchive.org/dandiset/000293/draft we do not have any warnings etc. So I think we simply do not validate consistency of "dandi-layout" and nwb metadata.

but overall it is indeed a "wild west" there on how people seems simply do not care to make their dandisets "valid". I think we would benefit from some formalized way to provide feedback to the dandiset owners for all those which remain invalid and authors do not really concern themselves.

my attempt to figure out why on sample /mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121 -- and check if there is different subject ids somehow ```shell (dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ for sub in 000293/sub-*; do echo -ne "$sub\t"; dandi -l ERROR ls -f json -r --metadata all dandi://dandi/$sub/ | jq -r '.[] | .metadata.wasAttributedTo[] | select(.schemaKey == "Participant") | .identifier' | uniq ; done 000293/sub-1801-18129009 1801-18129009 000293/sub-1802-18201004 1802-18201004 000293/sub-1802-18201011 1802-18201011 000293/sub-1803-18220008 1803-18220008 000293/sub-1803-18220019 1803-18220019 000293/sub-1808-18320005 1808-18320005 000293/sub-1808-18320021 1808-18320021 000293/sub-1809-18320031 1809-18320031 000293/sub-1813-18329014 1813-18329014 000293/sub-1813-18329044 1813-18329044 000293/sub-1813-18329051 1813-18329051 000293/sub-1813-18329062 1813-18329062 000293/sub-1815-18426017 1815-18426017 000293/sub-1822-18o22001 1822-18o22001 000293/sub-1822-18o22010 1822-18o22010 000293/sub-1822-18o22020 1822-18o22020 000293/sub-1902-19128006 1902-19128006 000293/sub-1902-19128040 1902-19128040 000293/sub-1903-19129058 1903-19129058 000293/sub-1903-19129072 1903-19129072 000293/sub-1906-19228044 1906-19228044 000293/sub-1906-19228058 1906-19228058 000293/sub-1906-19228068 1906-19228068 000293/sub-1907-19319025 1907-19319025 000293/sub-1908-19320022 1908-19320022 000293/sub-1909-19328001 1909-19328001 000293/sub-1909-19328009 1909-19328009 000293/sub-1909-19328019 1909-19328019 000293/sub-1909-19328034 1909-19328034 000293/sub-1909-19328046 1909-19328046 000293/sub-1911-19o10010 1911-19o10010 000293/sub-1911-19o10045 1911-19o10045 000293/sub-1911-19o10054 1911-19o10054 000293/sub-1911-19o10065 1911-19o10065 000293/sub-1912-2019-11-04-0001 1912-2019_11_04_0001 000293/sub-1912-2019-11-04-0083 1912-2019_11_04_0083 000293/sub-1913-2019-11-26-0006 1913-2019_11_26_0006 000293/sub-1913-2019-11-26-0019 1913-2019_11_26_0019 000293/sub-1913-2019-11-26-0037 1913-2019_11_26_0037 000293/sub-1914-2019-11-28-0010 1914-2019_11_28_0010 000293/sub-1914-2019-11-28-0038 1914-2019_11_28_0038 000293/sub-X2015-10-08-15o08002 X2015.10.08-15o08002 000293/sub-X2015-10-08-15o08007 X2015.10.08-15o08007 000293/sub-X2015-10-08-15o08011 X2015.10.08-15o08011 000293/sub-X2015-10-08-15o08017 X2015.10.08-15o08017 000293/sub-X2015-10-08-15o08020 X2015.10.08-15o08020 000293/sub-X2015-10-08-15o08022 X2015.10.08-15o08022 000293/sub-X2015-10-08-15o08032 X2015.10.08-15o08032 000293/sub-X2015-10-08-15o08038 X2015.10.08-15o08038 000293/sub-X2015-11-09-2015-11-09-0003 X2015.11.09-2015_11_09_0003 000293/sub-X2015-11-09-2015-11-09-0017 X2015.11.09-2015_11_09_0017 000293/sub-X2015-11-09-2015-11-09-0053 X2015.11.09-2015_11_09_0053 000293/sub-X2015-11-09-2015-11-09-0078 X2015.11.09-2015_11_09_0078 000293/sub-X2015-11-09-2015-11-09-0106 X2015.11.09-2015_11_09_0106 000293/sub-X2015-11-09-2015-11-09-0107 X2015.11.09-2015_11_09_0107 000293/sub-X2016-01-28-2016-01-28-0004 X2016.01.28-2016_01_28_0004 000293/sub-X2016-01-28-2016-01-28-0012 X2016.01.28-2016_01_28_0012 000293/sub-X2016-01-28-2016-01-28-0019 X2016.01.28-2016_01_28_0019 000293/sub-X2016-02-04-2016-02-04-0009 X2016.02.04-2016_02_04_0009 000293/sub-X2016-02-04-2016-02-04-0015 X2016.02.04-2016_02_04_0015 000293/sub-X2016-02-04-2016-02-04-0018 X2016.02.04-2016_02_04_0018 000293/sub-X2016-02-04-2016-02-04-0021 X2016.02.04-2016_02_04_0021 000293/sub-X2016-02-04-2016-02-04-0029 X2016.02.04-2016_02_04_0029 000293/sub-X2016-02-04-2016-02-04-0033 X2016.02.04-2016_02_04_0033 000293/sub-X2016-02-04-2016-02-04-0042 X2016.02.04-2016_02_04_0042 000293/sub-X2016-02-04-2016-02-04-0045 X2016.02.04-2016_02_04_0045 000293/sub-X2016-02-25-2016-02-25-0005 X2016.02.25-2016_02_25_0005 000293/sub-X2016-02-25-2016-02-25-0007 X2016.02.25-2016_02_25_0007 000293/sub-X2016-02-25-2016-02-25-0073 X2016.02.25-2016_02_25_0073 000293/sub-X2016-02-25-2016-02-25-0082 X2016.02.25-2016_02_25_0082 000293/sub-X2016-02-25-2016-02-25-0134 X2016.02.25-2016_02_25_0134 000293/sub-X2016-02-25-2016-02-25-0255 X2016.02.25-2016_02_25_0255 000293/sub-X2016-02-29-2016-02-29-0032 X2016.02.29-2016_02_29_0032 000293/sub-X2016-02-29-2016-02-29-0065 X2016.02.29-2016_02_29_0065 000293/sub-X2016-03-01-2016-03-01-0000 X2016.03.01-2016_03_01_0000 000293/sub-X2016-03-01-2016-03-01-0047 X2016.03.01-2016_03_01_0047 000293/sub-X2016-03-03-2016-03-03-0002 X2016.03.03-2016_03_03_0002 000293/sub-X2016-03-03-2016-03-03-0054 X2016.03.03-2016_03_03_0054 000293/sub-X2016-03-03-2016-03-03-0100 X2016.03.03-2016_03_03_0100 000293/sub-X2016-03-03-2016-03-03-0103 X2016.03.03-2016_03_03_0103 000293/sub-X2019-01-22-19122003 X2019.01.22-19122003 000293/sub-X2019-01-22-19122017 X2019.01.22-19122017 000293/sub-X2019-01-22-19122024 X2019.01.22-19122024 000293/sub-X2019-01-22-19122026 X2019.01.22-19122026 000293/sub-X2019-01-28-19128003 X2019.01.28-19128003 000293/sub-X2019-01-28-19128061 X2019.01.28-19128061 000293/sub-X2019-01-28-19128068 X2019.01.28-19128068 000293/sub-X2019-01-29-19129004 X2019.01.29-19129004 000293/sub-X2019-01-29-19129015 .... ```

so that relates to discussion above on sanitizaiton -- but it leads to doublecount now if metadata subj id is different from the filename one, and that is a bug.