JULIELab / trec-pm

Support code and resources for participation at the TREC Precision Medicine Track (TREC-PM)
http://trec-cds.appspot.com
MIT License
9 stars 2 forks source link

Update `umlsSynsets.txt` MD5 #38

Closed michelole closed 5 years ago

michelole commented 5 years ago

@khituras Could you confirm the new MD5?

michelole commented 5 years ago

Here we come again... Sorting doesn't change the file for me.

$ du umlsSynsets.txt
671872  umlsSynsets.txt
$ head -1 umlsSynsets.txt
C0000005     (131)I-MAA (131)I-Macroaggregated Albumin
khituras commented 5 years ago

Your file is much larger than mine:

faessler@dawkins:~/Coding/git/trec-pm/resources$ du umlsSynsets.txt 330524 umlsSynsets.txt faessler@dawkins:~/Coding/git/trec-pm/resources$ head umlsSynsets.txt C0000005 (131)I-MAA (131)I-Macroaggregated Albumin

Lets re-check our MRCONSO.RRF files. I got a new one from the full download I did. But it should be the same as before:

faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ wc -l MRCONSO.RRF 14608809 MRCONSO.RRF faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ du MRCONSO.RRF 1816888 MRCONSO.RRF faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ md5sum MRCONSO.RRF fd34e41376441439c37d89105a73c6b5 MRCONSO.RRF

On 18. Jul 2019, at 10:15, Michel Oleynik notifications@github.com wrote:

Here we come again... Sorting doesn't change the file for me.

$ du umlsSynsets.txt 671872 umlsSynsets.txt $ head -1 umlsSynsets.txt C0000005 (131)I-MAA (131)I-Macroaggregated Albumin — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JULIELab/trec-pm/pull/38?email_source=notifications&email_token=ABDO44FS6EZ22QQYH5IE2MLQAARAPA5CNFSM4IETD3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2HWKUI#issuecomment-512714065, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDO44FB6ZMCGRKTUABK453QAARAPANCNFSM4IETD3GQ.

michelole commented 5 years ago

Just realized that du is not a proper tool to compare sizes because the cluster size on APFS and ext4 may be different.

What about number of bytes as reported by wc?

✗ wc -c umlsSynsets.txt
 338449057 umlsSynsets.txt

Source file seems to be OK:

➜  umls wc -l MRCONSO.RRF
 14608809 MRCONSO.RRF
➜  umls du MRCONSO.RRF
3635256 MRCONSO.RRF
➜  umls wc -c MRCONSO.RRF 
 1860486173 MRCONSO.RRF
➜  umls md5sum MRCONSO.RRF 
fd34e41376441439c37d89105a73c6b5  MRCONSO.RRF
khituras commented 5 years ago

I see XD

faessler@dawkins:~/Coding/git/trec-pm/resources$ wc -c umlsSynsets.txt 338449057 umlsSynsets.txt

So this seems fine.

On 18. Jul 2019, at 10:32, Michel Oleynik notifications@github.com wrote:

Just realized that du is not a proper tool to compare sizes because the cluster size on APFS and ext4 may be different.

What about number of bytes as reported by wc?

✗ wc -c umlsSynsets.txt 338449057 umlsSynsets.txt Source file seems to be OK:

➜ umls wc -l MRCONSO.RRF 14608809 MRCONSO.RRF ➜ umls du MRCONSO.RRF 3635256 MRCONSO.RRF ➜ umls wc -c MRCONSO.RRF 1860486173 MRCONSO.RRF ➜ umls md5sum MRCONSO.RRF fd34e41376441439c37d89105a73c6b5 MRCONSO.RRF — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JULIELab/trec-pm/pull/38?email_source=notifications&email_token=ABDO44C7WDXFBOPSVXOEDXTQAATBHA5CNFSM4IETD3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2HXZBQ#issuecomment-512720006, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDO44BLY6CFOXP6Y4FSV53QAATBHANCNFSM4IETD3GQ.