MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
730 stars 298 forks source link

mimic3wdb-matched RECORDS file hast too many entries #466

Open tecamenz opened 10 months ago

tecamenz commented 10 months ago

We are trying to download the mimic3wdb-matched database via wfdb.io.dl_database like so:

wfdb.io.dl_database("mimic3wdb-matched", "mimic3wdb-matched", records='all', annotators='all', keep_subdirs=True, overwrite=False)

After a long wait, we get an error indicating a missing file: wfdb.io._url.NetFileNotFoundError: 404 Error: Not Found for url: https://physionet.org/files/mimic3wdb-matched/1.0/p01/p017488/3783537_10000.hea

While investigating we found that the corresponding RECORDS file contains more records than there are in the database: https://physionet.org/files/mimic3wdb-matched/1.0/p01/p017488/RECORDS

RECORDS file: image

Actual content: image

wfdb.io.dl_database generates unique urls using this RECORDS file which then leads to the mentioned error above.

Some questions:

  1. Can someone adapt the RECORDS file to reflect the database content
  2. The download via wfdb.io.dl_database is excruciating slow. Would it make sens to rewrite wfdb.io.dl_database to use multi-threading? Or what approach do you use to dump the whole database efficiently?
bemoody commented 9 months ago

Thanks for pointing this out. This is not a bug in wfdb-python, it's a bug in the database.

The RECORDS file is (probably) correct; the set of files on PhysioNet is wrong. It looks like some of the files are present in mimic3wdb but were not properly linked into the mimic3wdb-matched directory.

(One may also ask why on earth this record is split into over 10000 tiny segments. I have no idea.)