Closed michelole closed 5 years ago
Here we come again... Sorting doesn't change the file for me.
$ du umlsSynsets.txt
671872 umlsSynsets.txt
$ head -1 umlsSynsets.txt
C0000005 (131)I-MAA (131)I-Macroaggregated Albumin
Your file is much larger than mine:
faessler@dawkins:~/Coding/git/trec-pm/resources$ du umlsSynsets.txt 330524 umlsSynsets.txt faessler@dawkins:~/Coding/git/trec-pm/resources$ head umlsSynsets.txt C0000005 (131)I-MAA (131)I-Macroaggregated Albumin
Lets re-check our MRCONSO.RRF files. I got a new one from the full download I did. But it should be the same as before:
faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ wc -l MRCONSO.RRF 14608809 MRCONSO.RRF faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ du MRCONSO.RRF 1816888 MRCONSO.RRF faessler@h5:/data/data_resources/UMLS/UMLS2019/2019AA/META$ md5sum MRCONSO.RRF fd34e41376441439c37d89105a73c6b5 MRCONSO.RRF
On 18. Jul 2019, at 10:15, Michel Oleynik notifications@github.com wrote:
Here we come again... Sorting doesn't change the file for me.
$ du umlsSynsets.txt 671872 umlsSynsets.txt $ head -1 umlsSynsets.txt C0000005 (131)I-MAA (131)I-Macroaggregated Albumin — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JULIELab/trec-pm/pull/38?email_source=notifications&email_token=ABDO44FS6EZ22QQYH5IE2MLQAARAPA5CNFSM4IETD3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2HWKUI#issuecomment-512714065, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDO44FB6ZMCGRKTUABK453QAARAPANCNFSM4IETD3GQ.
Just realized that du
is not a proper tool to compare sizes because the cluster size on APFS and ext4 may be different.
What about number of bytes as reported by wc
?
✗ wc -c umlsSynsets.txt
338449057 umlsSynsets.txt
Source file seems to be OK:
➜ umls wc -l MRCONSO.RRF
14608809 MRCONSO.RRF
➜ umls du MRCONSO.RRF
3635256 MRCONSO.RRF
➜ umls wc -c MRCONSO.RRF
1860486173 MRCONSO.RRF
➜ umls md5sum MRCONSO.RRF
fd34e41376441439c37d89105a73c6b5 MRCONSO.RRF
I see XD
faessler@dawkins:~/Coding/git/trec-pm/resources$ wc -c umlsSynsets.txt 338449057 umlsSynsets.txt
So this seems fine.
On 18. Jul 2019, at 10:32, Michel Oleynik notifications@github.com wrote:
Just realized that du is not a proper tool to compare sizes because the cluster size on APFS and ext4 may be different.
What about number of bytes as reported by wc?
✗ wc -c umlsSynsets.txt 338449057 umlsSynsets.txt Source file seems to be OK:
➜ umls wc -l MRCONSO.RRF 14608809 MRCONSO.RRF ➜ umls du MRCONSO.RRF 3635256 MRCONSO.RRF ➜ umls wc -c MRCONSO.RRF 1860486173 MRCONSO.RRF ➜ umls md5sum MRCONSO.RRF fd34e41376441439c37d89105a73c6b5 MRCONSO.RRF — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JULIELab/trec-pm/pull/38?email_source=notifications&email_token=ABDO44C7WDXFBOPSVXOEDXTQAATBHA5CNFSM4IETD3G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2HXZBQ#issuecomment-512720006, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDO44BLY6CFOXP6Y4FSV53QAATBHANCNFSM4IETD3GQ.
@khituras Could you confirm the new MD5?