TheStanfordDaily / archives-scripts

Scripts for processing data on the Stanford Daily Archives
0 stars 0 forks source link

Content that are separated to different pages #8

Open hesyifei opened 5 years ago

hesyifei commented 5 years ago

For example, this article is separated on two pages: https://stanforddailyarchive.com/cgi-bin/stanford?a=d&d=stanford20140106-01.2.5&e=-------en-20--1--txt-txIN-------#

Screen Shot 2019-05-22 at 8 49 29 PM

But https://github.com/TheStanfordDaily/archives-text/blob/3e24b7ee6c55dac8fcff552e02119b502afd6f42/2014/01/06/MODSMD_ARTICLE4.article.txt only has the part that is on the first page.

https://github.com/TheStanfordDaily/archives-text/blob/3e24b7ee6c55dac8fcff552e02119b502afd6f42/2014/01/06/MODSMD_ARTICLE4.article.txt#L1-L53

epicfaace commented 5 years ago

Here's the relevant ALTO file: https://tiles.archives.stanforddaily.com/data.2014-oct/data/stanford/2014/01/06_01/Stanford_Daily-ALTO/Stanford_Daily_20140106_0001_ALTO0002.xml