Closed landreev closed 3 years ago
From 6505:
From @mankoff:
Hello. I was sent here from #4529.
I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer.
I'm imagining a solution where appending /download/
to any dataverse or dataset URL (where this feature is enabled) exposes the files within as a virtual folder structure. The advantages of this are:
wget
and other default tools (including GUI "DownThemAll" browser extension, for example) could be deployed against this URL, and would support filename filtering, inclusion, exclusion, etc. This offloads a whole bunch of functionality to the end-user download tool, rather than bloating Dataverse. If you zip, I promise there is or will be a feature request to "let me bulk download but filter on filename".Just some thoughts about how I'd like to see bulk download exposed as an end-user.
From @poikilotherm:
Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive.
It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens.
Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API.
From @landreev:
... I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely. OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:
No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly
Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)
(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)
My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".
But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.
From @mankoff:
Hi - you're right, this does not start the download. I was assuming wget
is pointed at that URL, and that starts the downloads.
As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.
From @mankoff:
I realize that if appending /download
to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files
would be better. In which case, appending /metadata
could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.
From @landreev:
I realize that if appending
/download
to the URL doesn't start the download ... that may not be the best URL. Perhaps/files
would be better.
I like /files
. Or /viewfiles
? - something like that.
I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.
From @landreev:
... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.
But I readily acknowledge that it's still bad and painful, even with streaming. The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.
From @poikilotherm:
Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.
Related to #7174 - the /files
view could expose versions, like this:
├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest
More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means doi:nn.nnnn/path/to/doi/for/v3/files/latest/
, or doi:nn.nnnn/path/to/doi/for/v3/files/2.4/
will download versions that are not v3
(the actual DOI used in this example). Could be confusing...
@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with?
Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around git-annex
, which allows for special remotes
DataLad is gaining traction especially in communities with big data needs like neuroimaging. Corss-Linking datalad/datalad#393 here.
@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense).
@mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2
Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄
@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.
@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)
Thanks all!
I agree that this should be ready to move into the "Up Next" column. Whatever decisions may still need to be made, we should be able to resolve as we work on it. The implementation should be straightforward enough. One big-ish question is whether there is already a good package that will render these crawl-able links that we can use; or if we should just go ahead and implement it from scratch. (since the whole point is to have these simple, straight html links w/no fancy ui features, the latter feels like a reasonable idea -?).
And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server.
@mankoff
Related to #7174 - the
/files
view could expose versions, like this:├── 1.0 ├── 1.1 ├── 2.0 ├── 2.1 ├── 2.2 ├── 2.3 ├── 2.4 ├── 3.0 ├── 4.0 ├── 5.0 └── latest
Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder.
I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket.
I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature).
Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that.
Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now)
The most useful minimal implementation is the latest files in a dataset: http://doi/view/latest
exposes a simple wget-friendly view of all files and folders. Note that view
is open for discussion - could be files
or list
or download
or something else. Versioning would only show the files in that version, so http://doi/view/4.0
might show different files and folders.
Aux and metadata? I guess. I notice when I download a dataset I get MANIFEST.TXT
even though I didn't ask for it. I'm not sure what happens if the dataset contains a real file called MANIFEST.TXT
. But there could be a virtual folder of aux and metadata too.
I'm not sure what your 3rd point means. But the point of this feature is not for the GUI. It's a way to make bulk download easy and accessible to the most common tools and user experience - "similarly to how files and directories on a filesystem look like when exposed behind an httpd server."
Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out.
Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem.
The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for wget
or a similar access method.
The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1.
I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse?
Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not.
Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former.
How would a/dataset doi/dataset version or :latest/file path
URI work? That would allow a stable URI for the file of a given path/name in the latest dataset version. If files are being replaced by files with different names this wouldn't work, but it would avoid trying to have both the dataset and file versioning schemes represented in the API.
If files are deleted or renamed, then a 404 or similar error seems fine.
Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition.
Here are some use-cases:
YYYY-MM-DD.tif
added every dayHow can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, /dataset doi/dataset version
won't find the latest, but could expose the files in dataset version
in a virtual folder. The URL with :latest/file path
won't work because the files for tomorrow don't exist in the 2nd example, where files get added every day. The URL dataset doi/view/latest
could expose the latest version in a simple virtual folder, but may confuse people because of the DV vs. Zenodo architecture decision, where dataset doi
is not meant to point to the latest version.
Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.) The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important. (Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)
Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files.
Another option that may simplify implementation: The :latest
is only valid for a dataset, not a file. Recall again that we're talking about two things in this ticket: 1) :latest
and to :view
, providing the virtual folder. If :latest
is limited to datasets and not files, then combining it with :view
provides access to the files within the latest dataset.
Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.)
Yes this works for both use cases.
I still point out that 10.5072/ABCDEF
is (in theory) the DOI for v1, so having it also point to latest because of an additional few characters (i.e., :latest
), could be confusing. But I think that is a requirement given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo).
Furthermore if /10.5072/ABCDEF/:latest/
is generalized to support :v1
, :v2
, etc. in addition to :latest
, then any DOI for any version within a dataset can be used to access any other version. For my daily updating data, after a year I have 365 DOIs, each of which can be used to access all 365 versions.
The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important.
I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.
(Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)
Hmmm. Ugh :). So I see the following choices:
File is deleted and not in latest version.
File is replaced and in the latest version:
File is deleted, then added, and exists in latest version
[X] API can point to the latest available
[ ] API can return error: ambiguous file
[ ] API for ":latest" can look at the DOI used, and trace it downstream. If the DOI was for the earlier version that got deleted, then return the latest file before deletion. If the DOI was for an intermediate version where it did not exist, return error. If the DOI was for a later version after it was added, trace it downstream and return the latest one.
This seems overly complicated and I'd vote for "just return the latest".
given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo)
@mankoff I'm confused by this. It's actually the opposite. The dataset DOI in Dataverse always points to the latest version of the dataset. If you use the "download all files in a dataset" API ( https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#downloading-all-files-in-a-dataset ) and pass the dataset DOI, you will get the latest files from version 7 or whatever.
You're absolutely right that Zenodo mints DOIs for each version of a dataset and that Dataverse doesn't to this (it's been requested in #4499). But again, in Dataverse the dataset DOI always points to the latest version. In Dataverse, if you want to download files from a specific (possibly older) version, you pass "3.1" or whatever. Please see https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#download-by-dataset-by-version
@pdurbin you are correct. I apologize for adding confusion to this conversation. I was confusing DOIs for datasets with DOIs for files. The file DOIs update.
@mankoff no problem, for downloading the latest version of a file (as you know, but for others) there's a new issue:
And now that I'm not as confused, I'll note that (I think) #7425 is solved if this ticket is implemented. If doi:nnnn/view/
exposes the dataset as a virtual folder, then you can link in there to get a fixed URL for a file as long as it exists in the latest version of the dataset.
Initial thoughts:
persistentId
as a path param using a regex in JAX-RS to match the param instead of having to place it in a query parameter. This doesn't break current API spec, just extends it.curl
? Show but download as text file stating the restricted access and how to gain access?wget
, so Data Access API. But it allows for browsing, so more like the JSON file view, so Native API?I would vote for not introducing a new API path but keep it in line with what we have and stay consistent. Either stay with Access API or Native API. Yet we can mix them a bit: implement the view itself in the Native API but let the download links point to the Access API using the data file endpoints.
Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:
edu.harvard.iq.dataverse.api.Access
. Dropping the "view" verb will download the files as a ZIP package. Please note that the current Access API has no verbs (with one yet irrelevant exception).Native API already has endpoints for versions and files in them, but so far using JSON only. We need to be careful around there not to break anything relying on it. We could introduce /api/datasets/{id}/versions/{version}/tree
for a folder view (there are already /files
and /metadata
for JSON).
This endpoint already has support for versions ":latest", ":draft", and ":latest-published" (which should also accept them without the colons).
Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:
Subfolders have to be depicted in the URL (so folders get created during the download). Files would be links to https://demo.dataverse.org/api/access/datafile/{fileId} or redirects to the same place if given via the URL.
Again, a vivid example: https://demo.dataverse.org/api/datasets/:persistentId/versions/latest/tree/subfolder/foo/bar/file.txt?persistentId=10.70122/FK2/IPZBAU redirects to https://demo.dataverse.org/api/access/datafile/xyz, triggering the download.
(Imagine using the PID in the URL directly... 🤩 )
@mankoff and anyone else who may be interested, the current implementation in my branch works as follows:
I called the new crawlable file access API "fileaccess":
/api/datasets/{dataset}/versions/{version}/fileaccess
(So the name/syntax follows the existing API /api/datasets/{dataset}/versions/{version}/files
, that shows the metadata for the files in a given version. I'm open to naming it something else; I'm considering "folderview", and maybe the version number should be passed as a query parameter instead).
The optional query parameter ?folder=<foldername>
specifies the subfolder to list.
For the {dataset}
id both the numeric and:persistentId
notation are supported, like in other similar APIs.
The API outputs a simple html listing (I made it to look like the standard Apache directory index), with Access API download links for individual files, and recursive calls to the API above for sub-folders.
I think it's easier to use an example, and pictures:
Let's say we have a dataset version with 2 files, one of them with the folder named "subfolder" specified:
or, as viewed as a tree on the dataset page:
The output of the fileaccess API for the top-level folder (/api/datasets/NNN/versions/MM/fileaccess
) will be as follows:
with the underlying html source:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html><head><title>Index of folder /</title></head>
<body><h1>Index of folder / in dataset doi:XXX/YY/ZZZZ</h1>
<table>
<tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
<tr><th colspan="4"><hr></th></tr>
<tr><td><a href="/api/datasets/NNNN/versions/MM/fileaccess?folder=subfolder">subfolder/</a></td><td align="right"> - </td><td align="right"> - </td><td align="right"> </td></tr>
<tr><td><a href="/api/access/datafile/KKKK">testfile.txt</a></td><td align="right">13-January-2021 22:35</td><td align="right">19 B</td><td align="right"> </td></tr>
</table></body></html>
And if you follow the ../fileaccess?folder=subfolder
link above it will produce the following view:
with the html source as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html><head><title>Index of folder /subfolder</title></head>
<body><h1>Index of folder /subfolder in dataset doi:XXX/YY/ZZZZ</h1>
<table>
<tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
<tr><th colspan="4"><hr></th></tr>
<tr><td><a href="/api/access/datafile/subfolder/LLLL">50by1000.tab</a></td><td align="right">11-January-2021 09:31</td><td align="right">102.5 KB</td><td align="right"> </td></tr>
</table></body></html>
Note that I'm solving the problem of having wget --recursive
preserve the folder structure when saving files by embedding the folder name in the file access API URL: /api/access/datafile/subfolder/LLLL
, instead of the normal /api/access/datafile/LLLL
notation.
Yes, this is perfectly legal! You can embed an arbitrary number of slashes into a path parameter, by using a regex in the @Path
notation:
@Path("datafile/{fileId:.+}")
The wget command line for crawling this API is NOT pretty, but that's what I came up with so far, that actually works:
wget --recursive -nH --cut-dirs=3 --content-disposition http://localhost:8080/api/datasets/NNNN/versions/1.0/fileaccess
Any feedback - comments/suggestions - are welcome.
This looks good at first pass. I did not know of the --content-disposition
flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...
One concern is the version
part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest
rather than a number?
I did not know of the
--content-disposition
flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...
Correct, without the "--content-disposition" flag wget will download http://host/api/access/datafile/1234
and save it as 1234
. With this flag wget will use the real filename that we supply in the "Content-Disposition:" header. (browsers do this automatically, so this header is the reason a browser offers to save a file downloaded from our dataset page under its user-friendly name).
It is, unfortunately, impossible to use that header to supply a folder name as well. If you try something like Content-disposition: attachment; filename="folder/subfolder/testfile.txt"
the "folder/subfolder" part is ignored, and the file is still saved as "testfile.txt".
So I rely on both this header, and embedding the folder name into the access url, and --cut-dirs=3
to download /api/access/datafile/folder/subfolder/1234
and have it saved as folder/subfolder/testfile.txt
.
One concern is the
version
part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set tolatest
rather than a number?
It understands our standard version id notations like :draft
, :latest
and :latest-published
.
But yes, I am indeed considering dropping the version from the path. So it would be
/api/datasets/{datasetid}/fileaccess
defaulting to the latest version available; with the optional ?version={version}
query parameter for requesting a different version.
@mankoff Hi, a quick followup to the comments above: I ended up dropping the version parameter from the path. I also renamed the API. It is now called "dirindex" - to emphasize that it presents the dataset in a way that resembles the Apache Directory Index format.
So the API path is now
/api/datasets/{dataset}/dirindex
it defaults to the latest. An optional parameter?version={version}
can be used to specify a different version.
This is all documented in the API guide as part of the pull request linked above.
Hello. If my institution upgrades their Dataverse, will we receive this feature? Or is it implemented into some future release and is not included in the latest version installed when updating?
Hi @mankoff, this will be included in the next release, 5.4. I added the 5.4 tag to the PR:
https://github.com/IQSS/dataverse/pull/7579
Once 5.4 shows up in https://github.com/IQSS/dataverse/releases you'll be able to install and use the release with this feature. We expect this in the next few weeks - we're just waiting on a few more issues to finish up.
Hello. I see that demo.dataverse.org is now at v5.4, so I'd like to test this.
I'm reading the docs here https://guides.dataverse.org/en/latest/api/native-api.html?highlight=dirindex#view-dataset-files-and-folders-as-a-directory-index
And it seems to only work with the dataset ID. If I'm an end-user, how do I find the ID? Is there a way to browse the dirindex
using the DOI? Can you provide an example with this demo data set? https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/MV0TMN
Also, regarding point #5 from https://github.com/IQSS/dataverse/issues/7084#issuecomment-731053846 this API does not allow browsing. When I go to https://demo.dataverse.org/api/datasets/24/dirindex I'm given a ".index" to download in firefox, not something that I can view in my browser. This also means (I think?) that browser-tools that I hoped would use this feature, like DownThemAll probably won't work.
Also seeing a ".index.html" download with https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/HXJVJU or https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/PDRSIQ
The file contains the expected HTML page.
Well that tells me how to use this with DOI and not ID. I suggest making this option clear in the API docs. I'll add an issue for that.
This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.
I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.
The way it would work, there will be an API call (for example,
/api/access/dataset/<id>/files
) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all. The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.I would strongly support implementing this sometime soon (soon after v5.0 that is).