Implement access to the files in the dataset as a virtual folder tree

landreev commented 4 years ago

This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.

I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.

The way it would work, there will be an API call (for example, /api/access/dataset/<id>/files) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all. The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.

I would strongly support implementing this sometime soon (soon after v5.0 that is).

landreev commented 4 years ago

From 6505:

From @mankoff:

Hello. I was sent here from #4529.

I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer.

I'm imagining a solution where appending /download/ to any dataverse or dataset URL (where this feature is enabled) exposes the files within as a virtual folder structure. The advantages of this are:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly
No zipping of files that cannot be compressed. For example, NetCDF files with internal compression. I believe Apache on-the-fly compression can be configured per filetype (MIME or extension), so some files would still be transferred as compressed, but not all
wget and other default tools (including GUI "DownThemAll" browser extension, for example) could be deployed against this URL, and would support filename filtering, inclusion, exclusion, etc. This offloads a whole bunch of functionality to the end-user download tool, rather than bloating Dataverse. If you zip, I promise there is or will be a feature request to "let me bulk download but filter on filename".

Just some thoughts about how I'd like to see bulk download exposed as an end-user.

From @poikilotherm:

Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive.

It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens.

Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API.

From @landreev:

... I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely. OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly

Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)

(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)

My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".

But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.

From @mankoff:

Hi - you're right, this does not start the download. I was assuming wget is pointed at that URL, and that starts the downloads.

As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.

From @mankoff:

I realize that if appending /download to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files would be better. In which case, appending /metadata could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.

From @landreev:

I realize that if appending /download to the URL doesn't start the download ... that may not be the best URL. Perhaps /files would be better.

I like /files. Or /viewfiles? - something like that. I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.

From @landreev:

... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.

But I readily acknowledge that it's still bad and painful, even with streaming. The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.

From @poikilotherm:

Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.

mankoff commented 4 years ago

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

mankoff commented 4 years ago

More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means doi:nn.nnnn/path/to/doi/for/v3/files/latest/, or doi:nn.nnnn/path/to/doi/for/v3/files/2.4/ will download versions that are not v3 (the actual DOI used in this example). Could be confusing...

mankoff commented 3 years ago

@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with?

poikilotherm commented 3 years ago

Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around git-annex, which allows for special remotes

DataLad is gaining traction especially in communities with big data needs like neuroimaging. Corss-Linking datalad/datalad#393 here.

pdurbin commented 3 years ago

@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense).

@mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2

Community Dev
Needs Discussion
Up Next
IQSS Team - In Progress
Review
QA
Done

Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄

djbrooke commented 3 years ago

@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.

@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)

Thanks all!

landreev commented 3 years ago

I agree that this should be ready to move into the "Up Next" column. Whatever decisions may still need to be made, we should be able to resolve as we work on it. The implementation should be straightforward enough. One big-ish question is whether there is already a good package that will render these crawl-able links that we can use; or if we should just go ahead and implement it from scratch. (since the whole point is to have these simple, straight html links w/no fancy ui features, the latter feels like a reasonable idea -?).

And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server.

landreev commented 3 years ago

@mankoff

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder.

mankoff commented 3 years ago

I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket.

I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature).

Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that.

poikilotherm commented 3 years ago

Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now)

djbrooke commented 3 years ago

Are these files or datasets that we're showing here (versioning may be an issue - there may be some files that are not available in all versions). One proposal is for this to work for a specific version.
Should this tree cover aux files and metadata files? It would be good to have a canonical URI for this and for Bag files. Consider a binary switch, similar to how we handle "download all" on the dataset page. We should name/structure the API with this in mind
Not a GUI, but an API that provides this information - file/folder layout of resources - Rest API that provides HTML (ex. wget or other crawler)

mankoff commented 3 years ago

The most useful minimal implementation is the latest files in a dataset: http://doi/view/latest exposes a simple wget-friendly view of all files and folders. Note that view is open for discussion - could be files or list or download or something else. Versioning would only show the files in that version, so http://doi/view/4.0 might show different files and folders.
Aux and metadata? I guess. I notice when I download a dataset I get MANIFEST.TXT even though I didn't ask for it. I'm not sure what happens if the dataset contains a real file called MANIFEST.TXT. But there could be a virtual folder of aux and metadata too.
I'm not sure what your 3rd point means. But the point of this feature is not for the GUI. It's a way to make bulk download easy and accessible to the most common tools and user experience - "similarly to how files and directories on a filesystem look like when exposed behind an httpd server."

djbrooke commented 3 years ago

Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out.

mankoff commented 3 years ago

Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem.

The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for wget or a similar access method.

The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1.

I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse?

Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not.

Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former.

qqmyers commented 3 years ago

How would a/dataset doi/dataset version or :latest/file path URI work? That would allow a stable URI for the file of a given path/name in the latest dataset version. If files are being replaced by files with different names this wouldn't work, but it would avoid trying to have both the dataset and file versioning schemes represented in the API.

mankoff commented 3 years ago

If files are deleted or renamed, then a 404 or similar error seems fine.

Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition.

Here are some use-cases:

A dataset with files with fixed names that are updated every day (e.g 10 updated CSV and NetCDF files).
A dataset with a new file YYYY-MM-DD.tif added every day

How can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, /dataset doi/dataset version won't find the latest, but could expose the files in dataset version in a virtual folder. The URL with :latest/file path won't work because the files for tomorrow don't exist in the 2nd example, where files get added every day. The URL dataset doi/view/latest could expose the latest version in a simple virtual folder, but may confuse people because of the DV vs. Zenodo architecture decision, where dataset doi is not meant to point to the latest version.

qqmyers commented 3 years ago

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.) The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important. (Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)

mankoff commented 3 years ago

Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files.

Another option that may simplify implementation: The :latest is only valid for a dataset, not a file. Recall again that we're talking about two things in this ticket: 1) :latest and to :view, providing the virtual folder. If :latest is limited to datasets and not files, then combining it with :view provides access to the files within the latest dataset.

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.)

Yes this works for both use cases.

I still point out that 10.5072/ABCDEF is (in theory) the DOI for v1, so having it also point to latest because of an additional few characters (i.e., :latest), could be confusing. But I think that is a requirement given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo).

Furthermore if /10.5072/ABCDEF/:latest/ is generalized to support :v1, :v2, etc. in addition to :latest, then any DOI for any version within a dataset can be used to access any other version. For my daily updating data, after a year I have 365 DOIs, each of which can be used to access all 365 versions.

The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important.

I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.

(Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)

Hmmm. Ugh :). So I see the following choices:

File is deleted and not in latest version.

[ ] API for ":latest" can point to the latest available
[X] API for ":latest" can return error

File is replaced and in the latest version:

[X] API points to latest

File is deleted, then added, and exists in latest version

[X] API can point to the latest available
[ ] API can return error: ambiguous file
[ ] API for ":latest" can look at the DOI used, and trace it downstream. If the DOI was for the earlier version that got deleted, then return the latest file before deletion. If the DOI was for an intermediate version where it did not exist, return error. If the DOI was for a later version after it was added, trace it downstream and return the latest one.

This seems overly complicated and I'd vote for "just return the latest".

pdurbin commented 3 years ago

given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo)

@mankoff I'm confused by this. It's actually the opposite. The dataset DOI in Dataverse always points to the latest version of the dataset. If you use the "download all files in a dataset" API ( https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#downloading-all-files-in-a-dataset ) and pass the dataset DOI, you will get the latest files from version 7 or whatever.

You're absolutely right that Zenodo mints DOIs for each version of a dataset and that Dataverse doesn't to this (it's been requested in #4499). But again, in Dataverse the dataset DOI always points to the latest version. In Dataverse, if you want to download files from a specific (possibly older) version, you pass "3.1" or whatever. Please see https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#download-by-dataset-by-version

mankoff commented 3 years ago

@pdurbin you are correct. I apologize for adding confusion to this conversation. I was confusing DOIs for datasets with DOIs for files. The file DOIs update.

pdurbin commented 3 years ago

@mankoff no problem, for downloading the latest version of a file (as you know, but for others) there's a new issue:

Add API support for downloading the most recent file #7425

mankoff commented 3 years ago

And now that I'm not as confused, I'll note that (I think) #7425 is solved if this ticket is implemented. If doi:nnnn/view/ exposes the dataset as a virtual folder, then you can link in there to get a fixed URL for a file as long as it exists in the latest version of the dataset.

poikilotherm commented 3 years ago

Initial thoughts:

We cannot expect pathes like http://doi.org/10.70122/FK2/IPZBAU/foo/bar/xxx nor queries like http://doi.org/10.70122/FK2/IPZBAU?test=test to redirect to Dataverse and keep the path/query. It will just not resolve (go try).
Can we please stop using ":" in front of the version? It's an unnecessary character, needs escaping and can be missed easily.
Maybe we can make API URLs easier to write and read by enabling the persistentId as a path param using a regex in JAX-RS to match the param instead of having to place it in a query parameter. This doesn't break current API spec, just extends it.
How do we handle restricted files the user has no access to? Do not display? Use Javascript to show them in a browser but hide from curl? Show but download as text file stating the restricted access and how to gain access?
In what API section does this belong? It's for downloading via wget, so Data Access API. But it allows for browsing, so more like the JSON file view, so Native API?

I would vote for not introducing a new API path but keep it in line with what we have and stay consistent. Either stay with Access API or Native API. Yet we can mix them a bit: implement the view itself in the Native API but let the download links point to the Access API using the data file endpoints.

Using Access API only

Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:

https://demo.dataverse.org/api/access/dataset/:persistentId/versions/1.0/view?persistentId=10.70122/FK2/IPZBAU
https://demo.dataverse.org/api/access/dataset/:persistentId/versions/latest/view?persistentId=10.70122/FK2/IPZBAU redirecting me to https://demo.dataverse.org/api/access/dataset/:persistentId/versions/3.0/view?persistentId=10.70122/FK2/IPZBAU
https://demo.dataverse.org/api/access/dataset/:persistentId/versions/draft/view?persistentId=10.70122/FK2/IPZBAU would allow authenticated access to files not yet published.

This would be inline with the current Data Access API used for ZIP downloads, see edu.harvard.iq.dataverse.api.Access. Dropping the "view" verb will download the files as a ZIP package. Please note that the current Access API has no verbs (with one yet irrelevant exception).
If we go for moving the verb between "id" and "/versions", IMHO we should think about creating a v2 API endpoint, adding a verb for zipped download, too.

Using Native API for viewing + Access API for downloading

Native API already has endpoints for versions and files in them, but so far using JSON only. We need to be careful around there not to break anything relying on it. We could introduce /api/datasets/{id}/versions/{version}/tree for a folder view (there are already /files and /metadata for JSON).

This endpoint already has support for versions ":latest", ":draft", and ":latest-published" (which should also accept them without the colons).

Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:

https://demo.dataverse.org/api/datasets/:persistentId/versions/1.0/tree?persistentId=10.70122/FK2/IPZBAU
https://demo.dataverse.org/api/datasets/:persistentId/versions/latest/tree?persistentId=10.70122/FK2/IPZBAU redirecting me to https://demo.dataverse.org/api/datasets/:persistentId/versions/3.0/tree?persistentId=10.70122/FK2/IPZBAU
https://demo.dataverse.org/api/datasets/:persistentId/versions/draft/tree?persistentId=10.70122/FK2/IPZBAU would allow authenticated access to files not yet published.

Subfolders have to be depicted in the URL (so folders get created during the download). Files would be links to https://demo.dataverse.org/api/access/datafile/{fileId} or redirects to the same place if given via the URL.

Again, a vivid example: https://demo.dataverse.org/api/datasets/:persistentId/versions/latest/tree/subfolder/foo/bar/file.txt?persistentId=10.70122/FK2/IPZBAU redirects to https://demo.dataverse.org/api/access/datafile/xyz, triggering the download.

(Imagine using the PID in the URL directly... 🤩 )

landreev commented 3 years ago

@mankoff and anyone else who may be interested, the current implementation in my branch works as follows:

I called the new crawlable file access API "fileaccess": /api/datasets/{dataset}/versions/{version}/fileaccess (So the name/syntax follows the existing API /api/datasets/{dataset}/versions/{version}/files, that shows the metadata for the files in a given version. I'm open to naming it something else; I'm considering "folderview", and maybe the version number should be passed as a query parameter instead). The optional query parameter ?folder=<foldername> specifies the subfolder to list. For the {dataset} id both the numeric and:persistentId notation are supported, like in other similar APIs.

The API outputs a simple html listing (I made it to look like the standard Apache directory index), with Access API download links for individual files, and recursive calls to the API above for sub-folders.

I think it's easier to use an example, and pictures:

Let's say we have a dataset version with 2 files, one of them with the folder named "subfolder" specified:

dataset_page_files_view

or, as viewed as a tree on the dataset page: dataset_page_tree_view

The output of the fileaccess API for the top-level folder (/api/datasets/NNN/versions/MM/fileaccess) will be as follows:

index_view_top

with the underlying html source:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /</title></head>
    <body><h1>Index of folder / in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/datasets/NNNN/versions/MM/fileaccess?folder=subfolder">subfolder/</a></td><td align="right"> - </td><td align="right"> - </td><td align="right">&nbsp;</td></tr>
    <tr><td><a href="/api/access/datafile/KKKK">testfile.txt</a></td><td align="right">13-January-2021 22:35</td><td align="right">19 B</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

And if you follow the ../fileaccess?folder=subfolder link above it will produce the following view:

index_view_subfolder

with the html source as follows:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /subfolder</title></head>
    <body><h1>Index of folder /subfolder in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/access/datafile/subfolder/LLLL">50by1000.tab</a></td><td align="right">11-January-2021 09:31</td><td align="right">102.5 KB</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

Note that I'm solving the problem of having wget --recursive preserve the folder structure when saving files by embedding the folder name in the file access API URL: /api/access/datafile/subfolder/LLLL, instead of the normal /api/access/datafile/LLLL notation. Yes, this is perfectly legal! You can embed an arbitrary number of slashes into a path parameter, by using a regex in the @Path notation:

@Path("datafile/{fileId:.+}")

The wget command line for crawling this API is NOT pretty, but that's what I came up with so far, that actually works:

wget --recursive -nH --cut-dirs=3 --content-disposition http://localhost:8080/api/datasets/NNNN/versions/1.0/fileaccess

Any feedback - comments/suggestions - are welcome.

mankoff commented 3 years ago

This looks good at first pass. I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

landreev commented 3 years ago

I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

Correct, without the "--content-disposition" flag wget will download http://host/api/access/datafile/1234 and save it as 1234. With this flag wget will use the real filename that we supply in the "Content-Disposition:" header. (browsers do this automatically, so this header is the reason a browser offers to save a file downloaded from our dataset page under its user-friendly name). It is, unfortunately, impossible to use that header to supply a folder name as well. If you try something like Content-disposition: attachment; filename="folder/subfolder/testfile.txt" the "folder/subfolder" part is ignored, and the file is still saved as "testfile.txt". So I rely on both this header, and embedding the folder name into the access url, and --cut-dirs=3 to download /api/access/datafile/folder/subfolder/1234 and have it saved as folder/subfolder/testfile.txt.

landreev commented 3 years ago

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

It understands our standard version id notations like :draft, :latest and :latest-published. But yes, I am indeed considering dropping the version from the path. So it would be /api/datasets/{datasetid}/fileaccess defaulting to the latest version available; with the optional ?version={version} query parameter for requesting a different version.

landreev commented 3 years ago

@mankoff Hi, a quick followup to the comments above: I ended up dropping the version parameter from the path. I also renamed the API. It is now called "dirindex" - to emphasize that it presents the dataset in a way that resembles the Apache Directory Index format.

So the API path is now

/api/datasets/{dataset}/dirindex

it defaults to the latest. An optional parameter?version={version} can be used to specify a different version. This is all documented in the API guide as part of the pull request linked above.

mankoff commented 3 years ago

Hello. If my institution upgrades their Dataverse, will we receive this feature? Or is it implemented into some future release and is not included in the latest version installed when updating?

djbrooke commented 3 years ago

Hi @mankoff, this will be included in the next release, 5.4. I added the 5.4 tag to the PR:

https://github.com/IQSS/dataverse/pull/7579

Once 5.4 shows up in https://github.com/IQSS/dataverse/releases you'll be able to install and use the release with this feature. We expect this in the next few weeks - we're just waiting on a few more issues to finish up.

mankoff commented 3 years ago

Hello. I see that demo.dataverse.org is now at v5.4, so I'd like to test this.

I'm reading the docs here https://guides.dataverse.org/en/latest/api/native-api.html?highlight=dirindex#view-dataset-files-and-folders-as-a-directory-index

And it seems to only work with the dataset ID. If I'm an end-user, how do I find the ID? Is there a way to browse the dirindex using the DOI? Can you provide an example with this demo data set? https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/MV0TMN

mankoff commented 3 years ago

Also, regarding point #5 from https://github.com/IQSS/dataverse/issues/7084#issuecomment-731053846 this API does not allow browsing. When I go to https://demo.dataverse.org/api/datasets/24/dirindex I'm given a ".index" to download in firefox, not something that I can view in my browser. This also means (I think?) that browser-tools that I hoped would use this feature, like DownThemAll probably won't work.

poikilotherm commented 3 years ago

Also seeing a ".index.html" download with https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/HXJVJU or https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/PDRSIQ

The file contains the expected HTML page.

mankoff commented 3 years ago

Well that tells me how to use this with DOI and not ID. I suggest making this option clear in the API docs. I'll add an issue for that.

IQSS / dataverse

Implement access to the files in the dataset as a virtual folder tree #7084

Using Access API only

Using Native API for viewing + Access API for downloading