MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
55 stars 20 forks source link

File list pagination #1917

Open cxgoogle opened 1 year ago

cxgoogle commented 1 year ago

The project page's files panel doesn't paginate the files. This is a problem for projects with large number of files.

As a workaround for generating gigantic HTML pages that would crash a user's browser, we make grouping subdirectories like p01 which have no useful meaning.

A flat file hierarchy like this&prefix=&forceOnObjectsSortingFiltering=false) is much simpler to use, than for example, the MIMIC CXR files layout.

This is a pretty important missing feature that's actively influencing how database files are being published.

bemoody commented 1 year ago

Hi, cxgoogle!

What exactly did you mean to point to? Is http://bigstore/ some internal google thing?

we make grouping subdirectories like p01 which have no useful meaning.

Even if I accepted the premise that the subdirectory names are meaningless, I'm not seeing how "pages" are any less so.

cxgoogle commented 1 year ago

Sorry, I fixed the link: https://console.cloud.google.com/storage/browser/gcs-public-data--healthcare-nih-chest-xray/png

The pages would only be shown in the browser UI (or a list files API). It would only be used to support browsing a large flat file hierarchy.

We don't need to discuss the particulars of the MIMIC dataset file grouping, by patient ID, or other. I just think that a database creator/contributor should not have to deliberately group their files into subfolders to publish their large dataset on PhysioNet.

bemoody commented 1 year ago

Thanks. It is a valid concern and I do understand that allowing more flexibility in file structures could be beneficial both for data producers and data consumers, and might seem more elegant.

At the same time, a key principle of PhysioNet is that the databases are not tied to PhysioNet software or Google software or anybody else's software. We want databases to be accessible and usable in practice with a wide range of standard software (OSes, filesystems, servers, clients, applications...) Partially because we ourselves don't want to implement our own bespoke tools if we don't have to.

Sure, we could publish a database with millions of files in a flat directory. But it wouldn't be comfortable for anybody to work with in that form.

It wouldn't be comfortable for us - we'd need some mechanism to cache the partial directory listings.

It wouldn't be comfortable for anyone who wants to mirror the database - they'd have to use our custom server software rather than a standard web server.

It wouldn't be comfortable for people who want to download subsets of the database, or download the entire database by splitting it up over multiple days or multiple machines.

It wouldn't be any easier to browse online than putting the data in subdirectories. (It does occur to me that we could add "previous"/"next" buttons to the files panel - that's a nice thing we could do that wouldn't cost much.)

It would be miserable to browse or manipulate with standard command-line tools, and probably most GUI file browsers too.

You can argue that a flat directory might be a little bit more convenient for the consumer application (assuming the it's using some suitably-designed storage)... but when the filename is an opaque identifier, calling it "0000/0001_0000.png" or "0000/00000001_0000.png" isn't really any better or worse than "00000001_0000.png".

So actually, the web front-end is a red herring. I do want database contributors to deliberately arrange their files into a structure that's feasible to navigate (and yes, I'd like to make that easier, but I prioritize the convenience of users over that of authors.) And it's easier to tell people "the files need to be split into subdirectories so that the website isn't overloaded" than it is to explain all the reasons that huge directories are problematic.

(I also have to say, that Google interface is not exactly a paragon of usability. No permalinks, no way to navigate to page 20 other than clicking the arrow button 19 times, waiting 5 seconds each time? No way even to see how many pages there are?)