Open kcondon opened 8 years ago
While discussing IQSS/dataverse#4564 IQSS/dataverse#4573 and IQSS/dataverse#4590 this afternoon in standup @djbrooke mentioned that perhaps we should consider detection of data integrity issues, which is what this issue is about. Like @kcondon suggests above, this could consist of iterating through all the files in the database and ensuring that they are all present on disk (or S3 or Swift).
Moving to dataverse.harvard.edu repo. We could do this more easily in prod after merging IQSS/dataverse#5867.
Just noting in this issue that I found 72 files in the Harvard repository that went through the ingest process but whose original file format cannot be downloaded through the UI or API, such as the CSV version of the file at https://doi.org/10.7910/DVN/CZ3XO3/LQM3PB. Would a routine file system audit detect this, too?
I found these files while querying a copy of the Harvard repository's database for file and dataset sizes (for an unrelated project) and saw that in the "datatable" table, the originalfilesize of some of the files is -1 or 0. After some testing and digging, I found that when the originalfilesize is less than 1 and the originalfileformat is anything other than 'application/x-rlang-transport', trying to download the file's original format returns an error message.
As shown in IQSS/dataverse#9501, there can also be items with an incorrect md5 hash or an incorrect size. An audit that compares these contents to the DB could be useful.
I've been using something this Python script, but the throughput is quite bad when going externally, and I don't want to hammer the server too hard.
In [2]: import requests, tqdm, hashlib, time
...: dois = requests.get("https://raw.githubusercontent.com/atrisovic/dataverse-r-study/master/get-dois/dataset_dois.txt").text.strip().split("\n")
...: for doi in tqdm.tqdm(dois[16:]):
...: doi_obj = requests.get(f"http://dataverse.harvard.edu/api/datasets/:persistentId/versions/:latest?persistentId={doi}").json()
...: for file_obj in doi_obj["data"]["files"]:
...: if not file_obj["restricted"] and file_obj["dataFile"]["filesize"] < 1024**2: # 1MiB
...: # only check small files; big files take too long.
...: dlurl = f"https://dataverse.harvard.edu/api/access/datafile/{file_obj['dataFile']['id']}"
...: if "originalFileFormat" in file_obj["dataFile"]:
...: dlurl += "?format=original"
...: actual_hash = hashlib.md5(requests.get(dlurl).content).hexdigest()
...: if actual_hash != file_obj["dataFile"]["md5"]:
...: print(doi, file_obj["label"], file_obj["dataFile"]["md5"], actual_hash, f"curl -Lq '{dlurl}' | md5sum")
...: break
...: time.sleep(1)
...:
and I don't want to hammer the server too hard.
We appreciate that. :)
I wasn't actively using this issue; I actually thought it was closed back in the main repo, after some work on auditing was done in 2019 (that Danny mentions above). But sure, we can use it going forward for discrepancy reports from the users. I will look into the missing originals mentioned above (there are some tab. files that are missing originals, for various historical reasons, but I'll need to take a closer look to see if those are old, or new cases)
To better identify any potential data integrity issues, we should periodically audit the file system, comparing files there with files listed in the db. Not sure whether this should be a scheduled item, part of the app, or a script outside the app.