allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
508 stars 55 forks source link

Streaming data for inferrence #11

Closed ckald closed 3 years ago

ckald commented 4 years ago

Hi! I'm trying to embed some 100M papers using SPECTER. However, there's some kind of a memory leak that makes the whole process extremely inefficient. I see that AllenNLP models support JSONL input format.

What is the simplest way to replace the ids and metadata args with a single JSONL file or stdin?

armancohan commented 4 years ago

For prediction, the code already should support using JSONL format in streaming way. See line 98 in predict command. And allennlp's _get_json_data command. Our code currently reads the data twice. First it counts the number of lines in the input at this line. If you want to make if fully streamable remove the line and remove total_size from tqdm at lines 92 and 97 accordingly.

armancohan commented 3 years ago

Closing this for now. Feel free to open if you still have issues.