acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
424 stars 283 forks source link

Scanning of IWPT proceedings 1991-2000 #355

Closed kilian-gebhardt closed 4 years ago

kilian-gebhardt commented 5 years ago

We need to find a suitable scanning company to do the scanning of the IWPT proceedings. The physical copies are located at different places in continental Europe.

Questions relevant for selection:

Page numbers:

kilian-gebhardt commented 5 years ago

Some European companies that list pricing information on their websites:

EDIT: all the above offers include OCR. I'm not sure if they can handle multiple languages in one Book – I think there could be at least non-English example sentences in the papers.

mjpost commented 5 years ago

Thanks for this! These prices are all more or less fine (maybe scan2go is a bit pricey, but it is non-destructive).

I have to imagine most of these were printed in grayscale, so perhaps that would suffice. Perhaps we could check?

I wonder if there are any recommended scanning settings anywhere? For example, what did the Google Books project do? I have heard that Steven Bird did much of the original scanning, perhaps we could ask him (I imagine he'd be responsive, but if not, I think @davidweichiang knows him fairly well).

Another question to ask: I wonder if any of these companies would also add an OCR text layer to the PDFs, which lets text highlighting and so on work. It'd be great if they could also be enlisted to do the metadata (title, authors).

kilian-gebhardt commented 5 years ago

I have to imagine most of these were printed in grayscale, so perhaps that would suffice. Perhaps we could check?

We should do this. Can you forward the e-mail contacts to me, so I can also ask for other information such as exact page numbers and if they would agree to destructive scanning?

I wonder if there are any recommended scanning settings anywhere? For example, what did the Google Books project do? I have heard that Steven Bird did much of the original scanning, perhaps we could ask him (I imagine he'd be responsive, but if not, I think @davidweichiang knows him fairly well).

Google book seems to use between 300dpi and 600dpi. I asked the digitalization team of our local library. They also suggest at least 300dpi and say 300dpi will be fine, as long as there are no fancy graphics or plots.

These prices are all more or less fine (maybe scan2go is a bit pricey, but it is non-destructive).

(overnight-scanning is non-destructive too and cheaper.) I contacted another company that was recommended to me by my local library for an offer. Once I have a quote I post it as well.

Another question to ask: I wonder if any of these companies would also add an OCR text layer to the PDFs, which lets text highlighting and so on work. It'd be great if they could also be enlisted to do the metadata (title, authors).

Edited above: at least basic OCR is part of the package. For meta-data creation I can contact the companies, however, we also can scrape it from the tables of contents here.

mjpost commented 5 years ago

Done on the emails.

300 dpi sounds good.

We could of course enter the metadata ourselves, but it might also be nice to see if we could outsource that. If it's not too expensive on top of scanning, we could save ourselves a little bit of effort.

kilian-gebhardt commented 5 years ago

I updated the issue with the information provided by Alberto and Harry. Both agree to destructive scanning. I will contact overnight-scanning to ask if they can produce metadata. The easiest thing to ask for are probably bibtex entries with:

mjpost commented 5 years ago

Awesome!

It may be easier for them to give us a spreadsheet (just thinking of likely tech skills).

Also please be sure to ask about OCR. I know this can be done with Acrobat Pro (I think that’s what @knmnyn used before) but it would be nice not to do it ourselves.

kilian-gebhardt commented 5 years ago

overnight-scanning responded:

Our company can definitely do the scanning for your request. Once we have the scanning done, we can quote you on the rest of the job. This is because different layouts, type of writing and other elements that are involved in data capture.

mjpost commented 5 years ago

Sounds good to me!

kilian-gebhardt commented 5 years ago

To catch up on this:

@mjpost This is where I need you to place the order – or we need to agree on some reimbursement plan.

mjpost commented 5 years ago

Pinging @desilinguist who should be able to help with the payment. Can we use the credit card to pay for the parcel shipping too, somehow? (That'd be easier than reimbursement)

kilian-gebhardt commented 5 years ago

I just clicked through the ordering process of UPS. You can schedule pick-up dates and pay with credit card in advance. We need to know the weights and dimensions of the parcels though for exact calculation of the fare. I will ask Harry and Alberto.

desilinguist commented 5 years ago

Perfect! Let me know what the calculation yields.

mjpost commented 5 years ago

Hi @kilian-gebhardt, can you update us on the status here? Has the scanning occurred yet?

kilian-gebhardt commented 5 years ago

@mjpost unfortunately not yet. Alberto wrote me that he cannot prepare shipping before mid September, the schedule with Harry is not clear – I will contact him again.

mjpost commented 5 years ago

Okay, no problem. I expected this would take time. Thanks for the update!

kilian-gebhardt commented 5 years ago

Hey @desilinguist and @mjpost,

Harry has his books packed (for some weeks now; I still have no news from Alberto – I suggest to proceed with Harrys books nevertheless). The shipping with UPS is between 30 and 40 Euro. The scanning costs 134,09 Euro (for all books, probably 110 Euro for just Harry's books). How shall we organize the ordering process/payment process? Some information needs to be entered in forms (page numbers of books, quality of scanning, etc.; parcel dimensions and addresses for shipping).

Best regards, Kilian

mjpost commented 5 years ago

Thanks, Kilian. This all sounds great to me. Nitin, can you arrange payment?

desilinguist commented 5 years ago

@kilian-gebhardt Is it possible for to you provide me all of the relevant information that you mentioned and I can fill out the form and put in the payment details as well?

kilian-gebhardt commented 5 years ago

@desilinguist I just sent you an email.

kilian-gebhardt commented 4 years ago

@desilinguist do you have any updates concerning this issue?

desilinguist commented 4 years ago

Sorry, I’m a bit swamped at work and with the upcoming ACL elections. This will have to wait until next week.

kilian-gebhardt commented 4 years ago

@mjpost and @desilinguist Alberto's proceedings volume is also ready. I updated the shipping/order instructions accordingly. Parcels from Italy to Romania are more pricey (ca. 55 Euro) though.

mjpost commented 4 years ago

The prices are fine I think, to get this done.

desilinguist commented 4 years ago

I am going to work on this today.

desilinguist commented 4 years ago

Okay, I have set up the UPS shipments and paid the scanning company as well.

kilian-gebhardt commented 4 years ago

Short update: Both parcels arrived at the scanning company.

mjpost commented 4 years ago

Excellent!

kilian-gebhardt commented 4 years ago

I just received a download link with the IWPT scans. In general the quality is very high; there are a hand full of papers that use some kind of italic font where OCR and contrast could be better – I will ask if they can rescan/reprocess those. A second issue is that there are some pencil marks, esp. in the 1991 volume. Since I will ask them for an offer for the meta-data creation, I may also ask if would try to remove the pencil marks and rescan these parts.

mjpost commented 4 years ago

Excellent! I haven’t looked yet so without seeing the marks I would suggest we shouldn’t worry about them too much. But if fixing them isn’t too costly it would be fine.

kilian-gebhardt commented 4 years ago

A somewhat delayed update (had to finalize my thesis in the last weeks): I received a second link where scans were done with slightly different contrast/lightness calibration. Also here, the compressed versions of the pdfs have in few instances strange artifacts, but the uncompressed ones are fine. I suppose that we can use their compressed version if it is OK and recompress the remaining articles by ourselves otherwise. I downloaded this archive, but maybe someone else can download a copy too.

Concerning the metadata creation, I received the response that overnight scanning is blocked with projects till the end of the year. But they offered to get back to me mid December.

Concerning myself: I will be on an extended leave until the end of February (I hoped that this project would be done by now). I can continue working on the project in March and figure out a solution with the scanning company then. If you want things to happen faster, I can also forward the contact with the scanning company to someone of you. Or we try to crowd-source the meta-data creation.

mjpost commented 4 years ago

Thanks, Kilian. I did have a look and both copies seem good to me. However I think that we can wait until you return in March.

I'd still be in favor of having them create the metadata if the price is reasonable. Crowd-sourcing (perhaps from IWPT participants) could also be a good approach.

kilian-gebhardt commented 4 years ago

@mjpost Hey I'm back. I just contacted the scanning company. They could create an excel spreadsheet with the information we need. Per article they would charge 9.50 Euro (with abstracts) or 6 Euro (without). Prices are without VAT. In total the 5 proceedings volumes contain ca. 160 articles.

This seems quite high to me given that the information can be aggregated and checked in a few minutes per paper.

mjpost commented 4 years ago

Welcome back! I agree this is too much. Let's seed a spreadsheet with the first paper, and then crowdsource it? I bet if we emailed the IWPT mailing list, people might contribute a conference. We should request that authors be written in BibTeX format ("last, first and last, first ...") so that we can parse them appropriately. If you want to create the spreadsheet I can help seed it later this week—or I can create it later this week if you're busy.

davidweichiang commented 4 years ago

Do the scans have OCR in them?

ETA: Yes, and it seems pretty easy to copy and paste. I don't think this will take that much work at all.

davidweichiang commented 4 years ago

If so, maybe Grobid could give a good rough cut.

ETA: I don't think it's worth it, though it's been on my mind for a while to train Grobid on our data to automate correction of metadata for new ingestions.

kilian-gebhardt commented 4 years ago

Ok, I prepared spreadsheets for each of the volumes and seeded it with the first entry (sometimes more). I included a "pages in pdf" column which will make it easier for us to automate the extraction of pdfs for the indvidual papers from the book pdf. The files are currently read-only.

https://drive.google.com/drive/folders/1CLaPIbdRVrfXx7l285ieuPBbWvDXenUE?usp=sharing

davidweichiang commented 4 years ago

Let's use BibTeX conventions: separate three or more authors with "and", use curly braces to protect title letters that should not be lowercased.

Let's try to true-case any words written in all caps.

kilian-gebhardt commented 4 years ago

@davidweichiang What about two authors? Aren't they separated also with "and"? Concerning true-casing and curly braces for protection, I hope I updated the files appropriately. I'm not completely sure about the standard we use (or where it is documented) – but we surely should explain this to the volunteers.

I publically hosted the _OCR_Optimized versions of the scans at: https://wwwtcs.inf.tu-dresden.de/~kilian/assets/IWPT/iwpt_{1991,1993,1995,1997,2000}.pdf.

davidweichiang commented 4 years ago

Yes, two authors too. For titles, we just protect the letters which begin proper names etc., and must never be lowercased. For example, A Principle-Based Parser for Foreign Language Training in {G}erman and {A}rabic.

kilian-gebhardt commented 4 years ago

Ok. I propose that I remove the write protection from the spreadsheets. Then everyone can contribute to the tables and also contributions of just 5 or 10 articles are possible. Live editing should prevent double assignment without the need to moderate explicitly. I hope that this makes it easier to find volunteers.

Occasionally I will add write protection to lines that are finished. Google docs has version history, so we do not need to fear accidental deletions or that someone is tampering.

mjpost commented 4 years ago

Sounds great. Some thoughts:

I would suggest asking for a single volunteer per year to simplify things. It really should only take about a half hour to go through them.

kilian-gebhardt commented 4 years ago

I agree to suggestions 1 and 2.

maybe separate out the page numbers into start and end, else you may get people using assortments of dash, ndash, and dash-dash (though we could probably handle this)

My thought was that we can handle this. But I can change it.

are there ever separate volumes in IWPT (for example, for student papers, or non-peer-reviewed papers, etc?) we might want to add a "volume" field to handle that, if so

1997 and 2000 have separate sections in the proceedings for invited talks, papers, and posters. But it's just one proceedings volume.

I would suggest asking for a single volunteer per year to simplify things. It really should only take about a half hour to go through them.

I think that it will take considerably longer: There are between 27 and 45 entries in each volume. At least authors will require manual processing (latex style first/last name separation, adding of 'and'). There might be the need to true-case. At times OCR is not that good, i.e., there is superfluous whitespace also inbetween letters of the same word or before punctuation. In particular checking the abstracts takes some time. On average I would calculate 5 minutes per paper.

davidweichiang commented 4 years ago

I think it'll be somewhere between @mjpost and @kilian-gebhardt's estimates.

That means the remaining changes will most consist of inserting or deleting spacing/punctuation, with minimal retyping of words.

davidweichiang commented 4 years ago

Sign up:

danielgildea commented 4 years ago

I just did 1997.

davidweichiang commented 4 years ago

Some notes:

kilian-gebhardt commented 4 years ago

I did 1993. It has an appendix with one paper which reuses page numbers from 1 to 17. How shall we deal with this?

mjpost commented 4 years ago

Great. In the TOC, the page number is given as 349, and then it starts on p. 351. Let's use that.

kilian-gebhardt commented 4 years ago

I wrote to the SIGPARSE mailing list and asked for volunteers for the remaining two volumes.

@davidweichiang I started working a script that processes the spreadsheets. Can you point me to existing code in the project that handles the processing of LaTeX and true-casing?

mjpost commented 4 years ago

I've recently written a similar script for ingesting the MT Archive that may be helpful: https://github.com/mardub1635/mt-archive/blob/master/scripts/ingest_tsv.py Let me know if anything isn't clear.