API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately - Githubissues

IQSS / dataverse

Open source research data repository software

http://dataverse.org

Other

878 stars 486 forks source link

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

Closed landreev closed 6 years ago

landreev commented 7 years ago

We occasionally get requests from users to revert their ingested tabular files to their original state. (example: RT 247789 - https://help.hmdc.harvard.edu/Ticket/Display.html?id=247789) Some data were never meant to be tabular. This is particularly common with Excel spreadsheets. For example, authors may use a spreadsheet for listing their bibliographical references; having something like that automatically converted to tabular format, and inviting users to "explore" it with TwoRavens is not really what they want.

(There is an open ticket #2199, to allow users to skip tabular ingest on a new file; this issue deals with a file that's already ingested).

The process is fairly straightforward. These uningest requests are currently handled by running a command line script. There are several things that need to happen: delete the datatable object, and its child objects - datavariables, summarystatistics, etc.; recalculate the version unf; replace the file with the saved original; remove any derivative files; restore the original size, mime type and the file name. All these steps simply need to be reimplemented inside the app.

A little bit of thought will need to be invested into figuring out how to add this option to the UI. (is it an extra button shown to the users with the edit permission on the dataset? should we use the existing checkboxes - with the extra option in the pulldown menu under "edit files"?)

djbrooke commented 6 years ago

For now, switching this up to cover a new administrative/curation API endpoint instead of a user-facing feature.

djbrooke commented 6 years ago

@landreev - in backlog grooming this week, you mentioned you'd share a script in this issue.

djbrooke commented 6 years ago

@landreev - we're taking this on this sprint, can you drop in that script? Thanks!

landreev commented 6 years ago

Missed the message above - sorry. Here's the script we are currently using in production to uningest a tabular file; creating an API call to do this would involve coding the same 9 or 10 steps from the script. (which in turn simply reverses what happens during a tabular ingest...)

The script takes the (database) id of the datafile; Note that it calls 2 other scripts that live on the prod. system (in /usr/local/dataverse-admin/bin): runsql - runs an arbitrary sql query in postgres; if it produces output, it will print it out tab-delimited; lookup_file_by_id.sh - does what the name suggests; for our purposes, it translates the database id of the datafile into the S3 storage identifier - for example "s3://dataverse-bucket/10.7910/DVN/PQ7IUE/15ea0342996-41b3ee674df0" that the AWS command line utility understands.

the script:

#!/bin/sh

# takes the datafile db id as the argument
# (see 247789-uningest-tabular/README.txt)

fileid=$1

if [ $fileid"x" = "x" ]
then
    echo "usge: ./uningest.sh <DATAFILE ID>"
    exit 1
fi

PATH=/usr/local/dataverse-admin/bin:$PATH; export PATH

filename=`echo $fileid | lookup_file_byid.sh  | grep '^FILE NAME' | awk '{print $4}'`

datatableid=`runsql -q 'SELECT id FROM datatable WHERE datafile_id='$fileid`

originaltype=`runsql -q 'SELECT originalfileformat FROM datatable where datafile_id = '$fileid`

echo "ORIGINAL TYPE: " $originaltype

# delete the tabular data associated with the file:

runsql -q "DELETE FROM summarystatistic WHERE datavariable_id in (SELECT id FROM datavariable WHERE datatable_id=$datatableid)"

runsql -q "DELETE FROM datavariable WHERE datatable_id=$datatableid"

runsql -q "DELETE FROM datatable WHERE id=$datatableid"

# and reset the ingest stats: 

runsql -q 'DELETE FROM ingestreport WHERE datafile_id = '$fileid
runsql -q "UPDATE datafile SET ingeststatus='A' where id=$fileid"

# reset the size (from the size of the generated tabular file, back to the size of the original)

originalsize=`aws s3 ls ${filename}.orig | awk '{print $3}'`
echo "ORIGINAL SIZE: " $originalsize

runsql -q "UPDATE datafile SET filesize=$originalsize WHERE id=$fileid"

# replace the generated tabular file with the preserved original: 

echo aws s3 mv $filename.orig $filename
aws s3 mv $filename.orig $filename

# fix the mime type: 

runsql -q "UPDATE datafile SET contenttype='$originaltype' WHERE id=$fileid"

case $originaltype in
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
    originalextension="xlsx"
    ;;
application/stata*)
    originalextension="dta"
    ;;
application/x-stata*)
    originalextension="dta"
    ;;
application/x-spss-sav)
    originalextension="sav"
    ;;
application/x-spss-por)
    originalextension="por"
    ;;
application/x-rlang-transport)
    originalextension="RData"
    ;;
text/csv)
    originalextension="csv"
    ;;
*)
    originalextension="unknown"
    ;;
esac

echo "restoring the ORIGINAL EXTENSION: "$originalextension

# Change the filename extension from ".tab" back to what it was before the file was ingested:

runsql -q "SELECT id,label FROM filemetadata WHERE datafile_id=$fileid" | sed 's/\.tab//' | while read fmid label
do
    echo "filemetadata id: "$fmid", changing label to "$label"."$originalextension
    runsql -q "UPDATE filemetadata SET label='${label}.${originalextension}' WHERE id=$fmid"
done

# finally, fix/recalculate the UNFs of the versions that have this file:

runsql -q "SELECT datasetversion_id FROM filemetadata WHERE datafile_id=$fileid" | while read versionid
do
    runsql -q 'UPDATE datasetversion SET unf=null WHERE id='$versionid
    curl -X POST http://localhost:8080/api/admin/datasets/integrity/$versionid/fixmissingunf?forceRecalculate=true
done

landreev commented 6 years ago

To clarify: whoever ends up working on this, you don't need to literally reimplement this script in Java 1:1! It's provided for reference, to list everything that needs to be done. But it is an admin script specifically written for our prod. system. It assumes that files live on S3 - but you're not going to make any such assumptions, you'll simply use the StorageIO system to replace the tabular file with the stored original; you don't need to write the code that generates the file extension based on the stored original mime type - there is already a method in FileUtil that does that. Etc.

landreev commented 6 years ago

I was finally able to test this with Swift on Fri. It's not working, so I'm moving it back to dev. to work on fixing the Swift driver, per our agreement. There were multiple other issues with the PR.

The size was not being reset to the original (OK, I fixed that);
The exception handling in the Uningest command: if replacing the physical file fails for whatever reason, the command proceeds with the rest of the uningest process; resulting in a file that looks like the original - it'll have the Stata (for example) mime type, extension, no UNF, etc. - but it'll still be the same tabular file on disk/s3/etc.!
In one of my earlier comments I mentioned that the prod. script I posted was for reference only; we really didn't want to reimplement it step-by-step in Java. Most importantly, we don't want to rely on the explicit DELETE queries to destroy the tabular data objects - datatable, datavariables, summary statistics, etc. Just like the DeleteDatafileCommand does not issue DELETE queries for the datafile and other sub-objects it's deleting; it relies instead on doing it cleanly with the EJB's EntityManage.remove(...) (via ctxt.em().remove(...)). This will automatically resolve all the cascade deletes. We want to delete the entire DataTable hierarchy this way in the Uningest command too. Again, issuing those explicit DELETE queries was only necessary when doing it outside the Dataverse application, in the script, on the command line.

I'm happy to fix all this too, while I'm working on the swift driver part. But we can talk about it during standup.

landreev commented 6 years ago

Fixed the issues in the last comment above. Moving back into code review (?)

kcondon commented 6 years ago

Issues/ questions so far:

[x] Api endpoint needs to be documented
[x] What should happen when ingested file is tabular but also geospatial? Should geospatial aspects be deleted? Dataverse side is deleted but WorldMap remains.
[x] Extra ingest-related files remain: .RData and .prep
[x] Export files remain from ingested version of files so still contain variable info.
[ ] Tabular tags remain after uningest

landreev commented 6 years ago

@sekmiller Are you sure it was necessary, to do that .merge() on the filemetadata, in order to get the new extension to stick? It was working for me consistently without that merge. (We do a merge on the DatasetVersion, that contains that FileMetadata two lines down...) Well, I don't think it should hurt either... Looks ok otherwise.

sekmiller commented 6 years ago

It wasn't working at all for me without it.