IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Audit Files: Perform routine file system audit, comparing db records to file system. #37

Open kcondon opened 8 years ago

kcondon commented 8 years ago

To better identify any potential data integrity issues, we should periodically audit the file system, comparing files there with files listed in the db. Not sure whether this should be a scheduled item, part of the app, or a script outside the app.

pdurbin commented 6 years ago

While discussing IQSS/dataverse#4564 IQSS/dataverse#4573 and IQSS/dataverse#4590 this afternoon in standup @djbrooke mentioned that perhaps we should consider detection of data integrity issues, which is what this issue is about. Like @kcondon suggests above, this could consist of iterating through all the files in the database and ensuring that they are all present on disk (or S3 or Swift).

djbrooke commented 4 years ago

Moving to dataverse.harvard.edu repo. We could do this more easily in prod after merging IQSS/dataverse#5867.

jggautier commented 2 years ago

Just noting in this issue that I found 72 files in the Harvard repository that went through the ingest process but whose original file format cannot be downloaded through the UI or API, such as the CSV version of the file at https://doi.org/10.7910/DVN/CZ3XO3/LQM3PB. Would a routine file system audit detect this, too?

I found these files while querying a copy of the Harvard repository's database for file and dataset sizes (for an unrelated project) and saw that in the "datatable" table, the originalfilesize of some of the files is -1 or 0. After some testing and digging, I found that when the originalfilesize is less than 1 and the originalfileformat is anything other than 'application/x-rlang-transport', trying to download the file's original format returns an error message.

In case it's helpful, here's the query I wound up with to find those files: ``` select distinct on (datatable.datafile_id) dvobject_dataset.identifier as dataset_identifier, datatable.datafile_id, case when dvobject_file.identifier is null then null else concat('https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/', dvobject_file.identifier) end as file_doi_url, dvobject_file.publicationdate as file_publicationdate, filemetadata.label as file_label, datatable.originalfileformat, datatable.originalfilesize from datatable join dvobject dvobject_file on dvobject_file.id = datatable.datafile_id join filemetadata on filemetadata.datafile_id = dvobject_file.id join dataset on dataset.id = dvobject_file.owner_id join dvobject dvobject_dataset on dvobject_dataset.id = dataset.id where dataset.harvestingclient_id is null and datatable.originalfileformat is not null and datatable.originalfileformat != '' and datatable.originalfileformat != 'application/x-rlang-transport' and datatable.originalfilesize < 1; ```
charmoniumQ commented 1 year ago

As shown in IQSS/dataverse#9501, there can also be items with an incorrect md5 hash or an incorrect size. An audit that compares these contents to the DB could be useful.

I've been using something this Python script, but the throughput is quite bad when going externally, and I don't want to hammer the server too hard.

In [2]: import requests, tqdm, hashlib, time
   ...: dois = requests.get("https://raw.githubusercontent.com/atrisovic/dataverse-r-study/master/get-dois/dataset_dois.txt").text.strip().split("\n")
   ...: for doi in tqdm.tqdm(dois[16:]):
   ...:     doi_obj = requests.get(f"http://dataverse.harvard.edu/api/datasets/:persistentId/versions/:latest?persistentId={doi}").json()
   ...:     for file_obj in doi_obj["data"]["files"]:
   ...:         if not file_obj["restricted"] and file_obj["dataFile"]["filesize"] < 1024**2: # 1MiB
   ...:             # only check small files; big files take too long.
   ...:             dlurl = f"https://dataverse.harvard.edu/api/access/datafile/{file_obj['dataFile']['id']}"
   ...:             if "originalFileFormat" in file_obj["dataFile"]:
   ...:                 dlurl += "?format=original"
   ...:             actual_hash = hashlib.md5(requests.get(dlurl).content).hexdigest()
   ...:             if actual_hash != file_obj["dataFile"]["md5"]:
   ...:                 print(doi, file_obj["label"], file_obj["dataFile"]["md5"], actual_hash, f"curl -Lq '{dlurl}' | md5sum")
   ...:                 break
   ...:             time.sleep(1)
   ...: 
landreev commented 1 year ago

and I don't want to hammer the server too hard.

We appreciate that. :)

landreev commented 1 year ago

I wasn't actively using this issue; I actually thought it was closed back in the main repo, after some work on auditing was done in 2019 (that Danny mentions above). But sure, we can use it going forward for discrepancy reports from the users. I will look into the missing originals mentioned above (there are some tab. files that are missing originals, for various historical reasons, but I'll need to take a closer look to see if those are old, or new cases)