desmarais-lab / govWebsites

1 stars 1 forks source link

city websites for top 100 cities in the US #5

Closed markusneumann closed 6 years ago

markusneumann commented 6 years ago

This data consists of 1,467,938 files (only txt, html, pdf, doc, and docx), totaling 857.2 GB

I am not sure what to do with this data right now, since (a) I always NEED TO (and there are good reasons for that) make backups before I do anything to the files, but in this case, I can't zip them because (at least) one website has file paths so long that the compression program can't handle them; and another 800GB backup seems really excessive (b) any analysis on this will take ages, and in some cases, probably won't be possible at all.

The main culprits are San Diego, New York and Chicago, which are around 30-60GB each

bdesmarais commented 6 years ago

Is it possible to process files to plain text as they stream in? Or just download 10GB at a time, process them to plain text, then delete the native files? We could hold onto a random sample of the original files in order to test the performance of the text conversion.

markusneumann commented 6 years ago

The reason I need a backup is because the conversion to text is not completely trivial. The step before the actual conversion is to check all the files for whether they have the correct file extension, and if necessary, change it. Unfortunately this is not exactly foolproof, and so far, with each new set of texts, I've had to do this more than once because some new kind of bizarre and unexpected error cropped up. For example, in the conversion of the websites of the 100 big cities' mayors, I had one case where the script changed a file with completely nonsensical content to a pdf, because the first line contained a long string of random characters, and at one point, that string happened to contain the letters 'P', 'D', 'F' in sequence. This then caused the readtext package (with which we convert the files to .txt) to throw an error. I've now made the code for that particular issue more stringent, but with ~1.4 million new files, it's fairly likely that some new problem I can't foresee will occur. Hence, I would rather have a backup.

bdesmarais commented 6 years ago

Can you write a test to determine whether the parser worked correctly? If so, we could keep a backup of the files that failed the test.

markusneumann commented 6 years ago

The compression problem seems to have been solved by using pigz, as recommended by Frido.