MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
745 stars 302 forks source link

https://physionet.org/rest/database-list/ is being deprecated #439

Open tompollard opened 1 year ago

tompollard commented 1 year ago

As discussed in https://github.com/MIT-LCP/physionet-build/pull/1697, we plan to deprecate the database-list API on PhysioNet, in favour of a new API served from: /api/v1/. If/when https://physionet.org/rest/database-list/ is deprecated, the following function will break:

https://github.com/MIT-LCP/wfdb-python/blob/14248d17e87f3392ab87a30dc03013233414a3db/wfdb/io/download.py#L181-L212

When PhysioNet releases its new API (most likely at https://physionet.org/api/v1/), we should migrate the get_dbs to the new API (or perhaps deprecate get_dbs() and replace this with a new function).

bemoody commented 1 year ago

Sure! Let's talk specifically about what this package needs.

Currently, https://physionet.org/api/v1/published is almost half a megabyte (and /rest/database-list/ is only a little smaller.) I don't want to download all of that when all I need is a list of projects, version numbers, and maybe titles.

get_dbs is one issue, but the larger issue in my mind is the ordinary record I/O functions like rdrecord. I feel rather strongly that these functions should be tied to semantic versions. For example, I should be able to write some code to say "read record 83411188 from mimic4wdb version 1"; that should automatically fetch the highest-numbered version in the 1.x series, but not the latest version in the 2.x series.

At present, you can pass an exact version number in pn_dir. If you don't pass an exact version number, then wfdb-python uses a very ugly kludge to fetch the "latest published" version from PhysioNet.

Better would be to say something like:

I also think that the caller should be able to explicitly specify the data repository (PhysioNet, Health Data Nexus, etc.) when calling rdrecord - you shouldn't have to mess with package "global settings" to switch between repositories.

And it would be preferable if you could specify the repository and database name and version and record name all as a single string (e.g. a URL.)

Would love thoughts from @cx1111 and @ikarosilva if you have any :)

tompollard commented 1 year ago

The caller can pass a database name (like "mimic4wdb") and a version number (like "1.0"); this should retrieve either version 1.0 or the highest-numbered version starting with 1.0..

I think there needs to be some kind of option to just get "latest". In many cases I think people will just want the biggest number.

bemoody commented 1 year ago

And it would be preferable if you could specify the repository and database name and version and record name all as a single string (e.g. a URL.)

Alternatively, instead of repository + database name + (exact) version, maybe use a DOI.

bemoody commented 1 year ago

I think there needs to be some kind of option to just get "latest". In many cases I think people will just want the biggest number.

Right, which I think is the problem. You write some code that says "use the latest version of MIMIC-IV" when the latest version of MIMIC-IV is 1.0, and later version 2.0 appears, and people trying to run your published code find that it doesn't work.

tompollard commented 1 year ago

Right, which I think is the problem. You write some code that says "use the latest version of MIMIC-IV" when the latest version of MIMIC-IV is 1.0, and later version 2.0 appears, and people trying to run your published code find that it doesn't work.

Got it, I understand your concerns about "latest" now. Makes sense. My point then is just that we should make it easy to discover the latest version of a dataset.

bemoody commented 1 year ago

My point then is just that we should make it easy to discover the latest version of a dataset.

Agreed!

tompollard commented 1 year ago

I like this discussion because it is a concrete example of why the semantic versioning for datasets is useful. If changes to a dataset would break a tool like WFDB (or the underlying analysis) then it's a major change.

IsmailElnaggar commented 1 year ago

Does this have something to do with the fact that

This line of code from the tutorial notebook: https://github.com/MIT-LCP/wfdb-python/blob/main/demo.ipynb

"record2 = wfdb.rdrecord('a103l', pn_dir='challenge-2015/training/')"

now returns a HTTPERROR?

HTTPError: 404 Client Error: Not Found for url: https://physionet.org/files/challenge-2015/training/a103l.hea

Actually any database I'm trying to access is now returning this error.

tompollard commented 1 year ago

@IsmailElnaggar we haven't yet deprecated the old API, so I don't think so, but it is definitely related.

The HTTP ERROR that's being returned by rdrecord looks incorrect to me. It is missing the version number. The correct URL for the file is: https://physionet.org/files/challenge-2015/1.0.0/training/a103l.hea.

You could fix the issue by adding the version number to your path. e.g.:

record = wfdb.rdrecord('a103l', pn_dir="challenge-2015/1.0.0/training/")

I think really this is a bug in rdrecord though. We should probably require version to be specified as an argument. @bemoody, any thoughts?

IsmailElnaggar commented 1 year ago

@tompollard Thank you for your fast reply.

I was actually trying to access a different db, the Long Term AF Database but was confused as to why my old code I hadn't touched in over a year wasn't working anymore.. Now that I added the version number it's working again thank you.

The code below now works.

record_number= "01" record = wfdb.rdrecord(record_number, pn_dir="ltafdb/1.0.0/")

tompollard commented 1 year ago

Thanks @IsmailElnaggar, glad to hear it! We will likely make some improvements to rdrecord to avoid confusion in the future (plus fix the example in the notebook!).

bemoody commented 1 year ago

Looks like the HTML format was changed at some point. So yes, PhysioNet broke it.

Trying to extract version number from the HTML was always a bad idea & that's why I want a real version list API. That said, the wfdb-python API could do with a lot of improvement.

tompollard commented 1 year ago

Trying to extract version number from the HTML was always a bad idea & that's why I want a real version list API.

We now serve versions at (e.g.): https://physionet.org/api/v1/project/published/mimiciv/

Is this what you are looking for? Or are you asking for a single endpoint that returns all versions of all projects? e.g. something like https://physionet.org/api/v1/project/published/ but with version details.

bemoody commented 1 year ago

We now serve versions at (e.g.): https://physionet.org/api/v1/project/published/mimiciv/

Yeah, we could use that. What I don't like about that is that it's a huge amount of information we don't need (HTML abstracts & titles & other redundant info for every single published version.) I'd rather be able to just ask for a list of versions. But yes, that's better than trying to parse /content/.

Need to make some changes to the WFDB-Python API in any event, & I've been putting off doing that.

tompollard commented 1 year ago

@IsmailElnaggar You should find that your original code now works again, thanks to a temporary fix by @bemoody (see: https://github.com/MIT-LCP/physionet-build/pull/1873).

Benjamin pointed out some kludgy code in the WFDB package that parses the HTML project pages on PhysioNet. We'll need to improve it at some point!

IsmailElnaggar commented 1 year ago

@tompollard @bemoody Thanks a lot for looking into this so quickly and getting a fix up. WFDB is such a great tool and now I see why!