Open tecamenz opened 10 months ago
Thanks for pointing this out. This is not a bug in wfdb-python, it's a bug in the database.
The RECORDS file is (probably) correct; the set of files on PhysioNet is wrong. It looks like some of the files are present in mimic3wdb but were not properly linked into the mimic3wdb-matched directory.
(One may also ask why on earth this record is split into over 10000 tiny segments. I have no idea.)
We are trying to download the mimic3wdb-matched database via
wfdb.io.dl_database
like so:wfdb.io.dl_database("mimic3wdb-matched", "mimic3wdb-matched", records='all', annotators='all', keep_subdirs=True, overwrite=False)
After a long wait, we get an error indicating a missing file:
wfdb.io._url.NetFileNotFoundError: 404 Error: Not Found for url: https://physionet.org/files/mimic3wdb-matched/1.0/p01/p017488/3783537_10000.hea
While investigating we found that the corresponding RECORDS file contains more records than there are in the database: https://physionet.org/files/mimic3wdb-matched/1.0/p01/p017488/RECORDS
RECORDS file:![image](https://github.com/MIT-LCP/wfdb-python/assets/52130886/4c40add3-3d47-4843-8a7f-ca9091a088f8)
Actual content:![image](https://github.com/MIT-LCP/wfdb-python/assets/52130886/c84d38d2-789f-4a7a-8f47-7a1af0da7fed)
wfdb.io.dl_database
generates unique urls using this RECORDS file which then leads to the mentioned error above.Some questions:
wfdb.io.dl_database
is excruciating slow. Would it make sens to rewritewfdb.io.dl_database
to use multi-threading? Or what approach do you use to dump the whole database efficiently?