YaleDHLab / ensemble-at-yale

Crowdsourcing the transcription of Yale playbills - http://bit.ly/ensemble-at-yale
http://ensemble.yale.edu
MIT License
6 stars 4 forks source link

Final PDF image load #126

Closed pleonard212 closed 7 years ago

pleonard212 commented 7 years ago

Some of these might already be accomplished, but just to lay them out:

1) Upload PDF pages to s3:

https://yale.box.com/s/hb24wollhna2c6oq5mbw7fxv34t3hr5s

2) Upload PDF thumbnails to s3:

https://yale.box.com/s/0uv5a78ue2naei8hmy6n45qyncdwlpo4

3) Ingest the new PDF programs:

https://github.com/YaleDHLab/ensemble-at-yale/blob/master/project/ensemble-at-yale/subjects/group_james_bundy_pdf.csv

(Hopefully I formatted that CSV file for easy ingestion.). I've given it a different name from the scanned Bundy material, in case it makes it easier to ingest cleanly. But all the items in it share the james_bundy group key.

Adding @lindsaymking to this in case she can answer any questions that come up...

duhaime commented 7 years ago

@pleonard212 @lindsaymking I just uploaded the images and began the rake task to reprocess the subjects. I'll add another quick note once that process completes...

duhaime commented 7 years ago

It looks like the rake task that ingests the subjects that belong to a group requires those files to have a particular naming convention, so we should probably create just a single file for the Bundy era.

@pleonard212 @lindsaymking, just to check, are the plays in the Bundy pdf file an addition to or a replacement for the files in the Bundy csv file?

pleonard212 commented 7 years ago

No prob, I will bundle up the old and new files into one master csv. The Bundy PDF's are in addition to the previous, non-PDF Bundy's.

pleonard212 commented 7 years ago

This commit should now have one master bundy list with older and new, pdf programs in it:

https://github.com/YaleDHLab/ensemble-at-yale/commit/f2ec3fa4cd019bf976d4ed8d4e0ffe0dec215e0c

duhaime commented 7 years ago

Awesome, thank you. The new rake task is running now, so we should be all set in an hour or so...

duhaime commented 7 years ago

The rake task just finished. How does this look? http://ensemble.yale.edu/#/groups/58bdafbd8b14044c27fdc318

pleonard212 commented 7 years ago

I think all we have to do now is to adjust the maximum extent of the slider on the Bundy era to extend beyond 2008... looks like the csv itself accurately reflects the new 2016 date, so maybe just a manual touch-up of the max date field in mongo for that group?

duhaime commented 7 years ago

The range of values on the page browser is set by the range of years in the playbills for the given group, which now run through 2016 for Bundy (after the ingest last night).

That said, we now have another problem: the page numbers of programs appear to be buggy. If you look at the Bundy era group browser, for instance, you'll see lots of duplicates. It looks like the process that produced those page numbers somehow processed the page number 10 as the page number 1. Here's a shot from the Stan W. era csv that seems to exhibit the problem:

screen shot 2017-03-07 at 10 28 31 am

It looks like the second instance of the page_no indicated as 1 should be page 10. Because it's marked as 1, though, that page gets included in the SubjectSetFirstPage for that SubjectSet, which is the model used to populate the view. Because there are two pages with page_no 1, we get the dupes in the view.

Previously we didn't see duplicates because we used different logic to identify whether a page was the first page of its SubjectSet. Previously, we looked for -p0001 in the page image itself, as it must have been found that the page_no was sometimes buggy. Because the new Bundy era data doesn't use the same image naming convention, we switched last night to checking the page_no value of a subject to see if that subject is the first subject of its subject set.

If all this is right, we have two options: fix the data or work around the data in the rake task. It's probably best to fix the data itself. Is it possible to do so? If not, I can update the rake task to work with the current data.

pleonard212 commented 7 years ago

OK! I think I have solved this.

1) Within the PDF subset of group_bundy, all programs that began with page_no values of 0 have been incremented by 1.

So no need to interrogate for the lowest page_no value; it will always be 1.

2) The page_no values in all CSV's that were mistakenly right-trimmed by 0 (10,20,30,etc) have been restored.

duhaime commented 7 years ago

Many thanks! The rake task just completed, so if the results look good to you, we should be all set to close this one!

pleonard212 commented 7 years ago

Fantastic!