Open tompollard opened 1 year ago
Sure! Let's talk specifically about what this package needs.
Currently, https://physionet.org/api/v1/published
is almost half a megabyte (and /rest/database-list/
is only a little smaller.) I don't want to download all of that when all I need is a list of projects, version numbers, and maybe titles.
get_dbs
is one issue, but the larger issue in my mind is the ordinary record I/O functions like rdrecord
. I feel rather strongly that these functions should be tied to semantic versions. For example, I should be able to write some code to say "read record 83411188 from mimic4wdb version 1"; that should automatically fetch the highest-numbered version in the 1.x series, but not the latest version in the 2.x series.
At present, you can pass an exact version number in pn_dir
. If you don't pass an exact version number, then wfdb-python uses a very ugly kludge to fetch the "latest published" version from PhysioNet.
Better would be to say something like:
The caller can pass a database name (like "mimic4wdb"
) and a version number (like "1.0"
); this should retrieve either version 1.0
or the highest-numbered version starting with 1.0.
.
For backward compatibility, you can just pass a database name, and the version number is set to "1"
by default.
I also think that the caller should be able to explicitly specify the data repository (PhysioNet, Health Data Nexus, etc.) when calling rdrecord
- you shouldn't have to mess with package "global settings" to switch between repositories.
And it would be preferable if you could specify the repository and database name and version and record name all as a single string (e.g. a URL.)
Would love thoughts from @cx1111 and @ikarosilva if you have any :)
The caller can pass a database name (like "mimic4wdb") and a version number (like "1.0"); this should retrieve either version 1.0 or the highest-numbered version starting with 1.0..
I think there needs to be some kind of option to just get "latest". In many cases I think people will just want the biggest number.
And it would be preferable if you could specify the repository and database name and version and record name all as a single string (e.g. a URL.)
Alternatively, instead of repository + database name + (exact) version, maybe use a DOI.
I think there needs to be some kind of option to just get "latest". In many cases I think people will just want the biggest number.
Right, which I think is the problem. You write some code that says "use the latest version of MIMIC-IV" when the latest version of MIMIC-IV is 1.0, and later version 2.0 appears, and people trying to run your published code find that it doesn't work.
Right, which I think is the problem. You write some code that says "use the latest version of MIMIC-IV" when the latest version of MIMIC-IV is 1.0, and later version 2.0 appears, and people trying to run your published code find that it doesn't work.
Got it, I understand your concerns about "latest" now. Makes sense. My point then is just that we should make it easy to discover the latest version of a dataset.
My point then is just that we should make it easy to discover the latest version of a dataset.
Agreed!
I like this discussion because it is a concrete example of why the semantic versioning for datasets is useful. If changes to a dataset would break a tool like WFDB (or the underlying analysis) then it's a major change.
Does this have something to do with the fact that
This line of code from the tutorial notebook: https://github.com/MIT-LCP/wfdb-python/blob/main/demo.ipynb
"record2 = wfdb.rdrecord('a103l', pn_dir='challenge-2015/training/')"
now returns a HTTPERROR?
HTTPError: 404 Client Error: Not Found for url: https://physionet.org/files/challenge-2015/training/a103l.hea
Actually any database I'm trying to access is now returning this error.
@IsmailElnaggar we haven't yet deprecated the old API, so I don't think so, but it is definitely related.
The HTTP ERROR that's being returned by rdrecord
looks incorrect to me. It is missing the version number. The correct URL for the file is: https://physionet.org/files/challenge-2015/1.0.0/training/a103l.hea.
You could fix the issue by adding the version number to your path. e.g.:
record = wfdb.rdrecord('a103l', pn_dir="challenge-2015/1.0.0/training/")
I think really this is a bug in rdrecord
though. We should probably require version to be specified as an argument. @bemoody, any thoughts?
@tompollard Thank you for your fast reply.
I was actually trying to access a different db, the Long Term AF Database but was confused as to why my old code I hadn't touched in over a year wasn't working anymore.. Now that I added the version number it's working again thank you.
The code below now works.
record_number= "01" record = wfdb.rdrecord(record_number, pn_dir="ltafdb/1.0.0/")
Thanks @IsmailElnaggar, glad to hear it! We will likely make some improvements to rdrecord
to avoid confusion in the future (plus fix the example in the notebook!).
Looks like the HTML format was changed at some point. So yes, PhysioNet broke it.
Trying to extract version number from the HTML was always a bad idea & that's why I want a real version list API. That said, the wfdb-python API could do with a lot of improvement.
Trying to extract version number from the HTML was always a bad idea & that's why I want a real version list API.
We now serve versions at (e.g.): https://physionet.org/api/v1/project/published/mimiciv/
Is this what you are looking for? Or are you asking for a single endpoint that returns all versions of all projects? e.g. something like https://physionet.org/api/v1/project/published/ but with version details.
We now serve versions at (e.g.): https://physionet.org/api/v1/project/published/mimiciv/
Yeah, we could use that. What I don't like about that is that it's a huge amount of information we don't need (HTML abstracts & titles & other redundant info for every single published version.) I'd rather be able to just ask for a list of versions. But yes, that's better than trying to parse /content/.
Need to make some changes to the WFDB-Python API in any event, & I've been putting off doing that.
@IsmailElnaggar You should find that your original code now works again, thanks to a temporary fix by @bemoody (see: https://github.com/MIT-LCP/physionet-build/pull/1873).
Benjamin pointed out some kludgy code in the WFDB package that parses the HTML project pages on PhysioNet. We'll need to improve it at some point!
@tompollard @bemoody Thanks a lot for looking into this so quickly and getting a fix up. WFDB is such a great tool and now I see why!
As discussed in https://github.com/MIT-LCP/physionet-build/pull/1697, we plan to deprecate the
database-list
API on PhysioNet, in favour of a new API served from:/api/v1/
. If/when https://physionet.org/rest/database-list/ is deprecated, the following function will break:https://github.com/MIT-LCP/wfdb-python/blob/14248d17e87f3392ab87a30dc03013233414a3db/wfdb/io/download.py#L181-L212
When PhysioNet releases its new API (most likely at
https://physionet.org/api/v1/
), we should migrate theget_dbs
to the new API (or perhaps deprecateget_dbs()
and replace this with a new function).