EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

PDF parsing #71

Closed leogao2 closed 3 years ago

leogao2 commented 3 years ago

Existing pdf parsing solutions are often not high enough quality. This is a wishlist for things we would want in a PDF parser.

bratao commented 3 years ago

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

hippke commented 3 years ago

I will give it a try. Using PDFMiner.six I convert a random sci-hub PDF to text and obtain for its first page the text appended below. Issues:

In general I believe that a useful fraction is convertible (>10%). Perhaps we need "only" an automatically way to determine whether the output is fine or garbage? What is the expected minimum quality level?

`PHYTOTHERAPY RESEARCH Phytother. Res. 13, 655–659 (1999)

Effects of Mistletoe Lectin I and Ionizing Radiation on the Glucose and Thymidine Uptake in Tumour Cells in vitro

Tamara Kubasova,1* Ileana Petcu,2 U. Pfu¨ller3 and G. J. Ko¨teles1 1Fre´de´ric Joliot-Curie National Research Institute for Radiobiology and Radiohygiene, Budapest, P.O. Box 101, H-1775 Hungary 2Horia Hulubei Institute of Physics and Nuclear Engineering, P.O. Box MG-6, R-76900 Bucharest, Romania 3Institute of Phytochemistry, University of Witten/Herdecke, D-58448 Witten, Germany

The increased uptake of hexose by mammalian cells is considered to be a general response to stress. Nowadays, mistletoe lectin separated from the extracts of the European mistletoe (Viscum album L.) is often used in adjuvant cancer therapy. The present work studies the effect of the lectin on unirradiated and x-irradiated tumour cells. The response of cultured human lung carcinoma cells (Calu-1) was fol- lowed by radioactive glucose uptake as well as by tritiated thymidine incorporation. The cells were main- tained either in a complete or a so-called restrictive medium.

Slight metabolic changes were found in the restrictive medium but not in the complete one. Mistletoe lectin I at a very low concentration (0.001 ng/mL) increased the glucose uptake and thymidine incorporation. Ionizing radiation (1 Gy) did not influence the hexose uptake but it enhanced the incorporation of thymidine. It seems that the actions of two different factors (mistletoe lectin I and radiation) proved to be rather provoking stress effects for the tumour cells as detected in the restrictive medium. Copyright # 1999 John Wiley & Sons, Ltd.

Keywords: Calu-1; mistletoe lectin I; ionizing radiation; thymidine; D-glucose; metabolic response.

INTRODUCTION

The hexose uptake of mammalian cells is known to change under certain stress circumstances (Gray et al., 1983; Weber et al., 1984; Warren et al., 1986; Pasternak et al., 1991). This increased glucose uptake upon environmental stress can be considered as a general response of cells through changes in plasma membrane function. Alteration of the physiological conditions of membranes have also been shown in our earlier experiments in vitro on different cell cultures and blood cells exposed to ionizing radiation at relatively low doses (0.25–2.5 Gy), as detected by the binding of radiolabelled concanavalin A lectin to the cell surfaces (Ko¨teles et al., 1976; Kubasova et al., 1981a, 1981b, 1984). the use of different

treatments (cytostatic drugs, radiation, adjuvant preparations) in cancer therapy can lead to the alteration of plasma membrane function and metabolic processes in both malignant and normal cells. The favourable effects of extracts from the European mistletoe Viscum album L. have been known for over 70 years for the treatment of inflammatory diseases and also cancer hypertonia, (Kwaja et al., 1986; Hajto et al., 1989; Franz, 1991; Kuttan, 1993; Gabius et al., 1994). The effect of the extracts is attributed to their main constituent, lectin.

is evident

that

It

The aim of the present work was to study the metabolic changes in cultured tumour cells (human lung carcinoma line Calu-1) on the effect of mistletoe lectin I (ML I) used widely in cancer adjuvant therapy. Uptake of 3H-glucose by the cells and incorporation of 3H-thymidine into them were used to reflect the metabolic changes in Calu-1 cell cultures. This experimental approach was intended to reveal whether ML I treatment at very low lectin concentrations (0.001 ng/mL) produces any modification in the response of x-irradiated cells.

MATERIALS AND METHODS

Human lung carcinoma cell line Calu-1. This was a gift of the Memorial Sloan-Kettering Cancer Center, since 1986 the cells have been adapted to RPMI-1640 medium supplemented with 10% fetal calf serum (FCS), L- glutamine and antibiotics (complete medium). The cells were grown on tissue culture plates of 24 wells (Greiner, Germany). In separate experiments, 1 (cid:2) 105–1.4 (cid:2) 105 cells/mL in a well were used for plating. All cells were incubated in the complete medium for 4 h; then, for one part of the cells (used in the deoxy-D-glucose uptake assay), the medium was replaced by the restrictive medium containing 0.5 % FCS only and the incubation was continued for 3 h. Half of the cultures, in both the complete and the restrictive media, were irradiated with x-rays. For the irradiation period the medium was changed to a serum-free one. Starting immediately after the radiation exposure the irradiated and unirradiated

CCC 0951–418X/99/080655–05 $17.50 Copyright # 1999 John Wiley & Sons, Ltd.

Received 25 November 1998 Accepted 28 January 1999

656

BismarckBamfo commented 3 years ago

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

Try it on this pdf http://www.math.bas.bg/mathmod/Proceedings_CTF/CTF-1984/files_CTF-1984/CTF-1984-334-345.pdf

hippke commented 3 years ago

Attached is the result from ABBYY FineReader 15. Looks OK to me. I'd give it 90%. The document is pretty difficult (russian math) and in bad shape (warps). Most of sci-hub will be much better. I estimate >90% of sci-hub will be 99% or better.

ABBYY has automation capability.

Is that "good enough"?

CTF-1984-334-345.docx CTF-1984-334-345-from-abbyy.txt

hippke commented 3 years ago

Also, companies like ABBYY and Omnipage have built these OCR solutions over decades, and likely put 100m++ USD into the research and dev. We won't improve over that on our own in the short term. It's either such a solution, or it's not good enough and can be tried again in a decade.

StellaAthena commented 3 years ago

Attached is the result from ABBYY FineReader 15. Looks OK to me. I'd give it 90%. The document is pretty difficult (russian math) and in bad shape (warps). Most of sci-hub will be much better. I estimate >90% of sci-hub will be 99% or better.

ABBYY has automation capability.

Is that "good enough"?

CTF-1984-334-345.docx CTF-1984-334-345-from-abbyy.txt

Thanks for doing the conversion! I’ll take a look at it this weekend. Just an FYI, the text is in English not Russian. It’s from a Russian academic journal.

And yeah, we know it’s worse than most texts we will encounter. That’s what makes it a good test case :)

leogao2 commented 3 years ago

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

@bratao What volume of pdfs are you able to process? We may be processing a very large amount of pdfs. (Think 100TB in total)

hippke commented 3 years ago

Compute will be relevant. For the 7.3 MB testfile my Core i7-7700k needs 15s in FineReader, i.e. 500kb/sec. For 100 TB of PDFs it would be 5 months.

leogao2 commented 3 years ago

Compute will be relevant. For the 7.3 MB testfile my Core i7-7700k needs 15s in FineReader, i.e. 500kb/sec. For 100 TB of PDFs it would be 5 months.

That is not an issue. We're not in a hurry, and we can obtain much, much more compute than a single 7700k.

(At present we have 64 cores at our immediate disposal, which would bring that time down to well under a month, and we can obtain more if necessary.)

trisongz commented 3 years ago

I would recommend checking out JSL's PDF OCR - has high scalability, and have tested it myself on 4GBs of PDFs, with good results. Getting started requires some tweaking and fine tuning (memory, cores, etc), but once all the settings are in place, it's fairly stable and reliable.

I went through several PDF readers including the ones listed above (notable mention to parsr) and ultimately went with JSL.

bratao commented 3 years ago

@leogao2 the ocr I do have access will have results very similar or better than the finereader or omnipage. So the results are practically identical to @hippke . But I can further tweak to try to eliminate the header and footer.

It can process 30 pages per minute in a single vps core. If I rent a epyc dedicated server from hetzner I think it will process 100tb in less than a month. I can do it.