allenai / SciREX

Data/Code Repository for https://api.semanticscholar.org/CorpusID:218470122
Apache License 2.0
129 stars 30 forks source link

Question: How would I reformat a paper into the test.jsonl format #19

Open DarrinGlad opened 3 years ago

DarrinGlad commented 3 years ago

Hello, I was wondering if there were any tools already implemented a way to format a paper into the test.jsonl format

successar commented 3 years ago

Sorry, we don't have specific tools to convert data in test.jsonl format (since it very much depend on what the initial source of data is !). Note if you want to make a prediction on new document, the data only needs 4 fields - { "doc_id" : str = Document Id as used by Semantic Scholar, "words" : List[str] = List of words in the document, "sentences" : List[Span] = Spans indexing into words array that indicate sentences, "sections" : List[Span] = Spans indexing into words array that indicate sections, }

The remaining fields are needed when you want to use your own data to train the model.

muguruzawang commented 3 years ago

Sorry, we don't have specific tools to convert data in test.jsonl format (since it very much depend on what the initial source of data is !). Note if you want to make a prediction on new document, the data only needs 4 fields - { "doc_id" : str = Document Id as used by Semantic Scholar, "words" : List[str] = List of words in the document, "sentences" : List[Span] = Spans indexing into words array that indicate sentences, "sections" : List[Span] = Spans indexing into words array that indicate sections, }

The remaining fields are needed when you want to use your own data to train the model.

But when I format my data in the way as you stated, I meet the problem below: Traceback (most recent call last): File "scirex/predictors/predict_ner.py", line 123, in main() File "scirex/predictors/predict_ner.py", line 119, in main predict(archive_folder, test_file, output_file, cuda_device) File "scirex/predictors/predict_ner.py", line 35, in predict instances = dataset_reader.read(test_file) File "/home/jttang/.conda/envs/scirex_wpc/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 134, in read instances = [instance for instance in Tqdm.tqdm(instances)] File "/home/jttang/.conda/envs/scirex_wpc/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 134, in instances = [instance for instance in Tqdm.tqdm(instances)] File "/home/jttang/.conda/envs/scirex_wpc/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1005, in iter for obj in iterable: File "/dat01/jttang/wpc/information_extraction/SciREX/SciREX/scirex/data/dataset_readers/scirex_full_reader.py", line 148, in _read json_dict = clean_json_dict(json_dict) File "/dat01/jttang/wpc/information_extraction/SciREX/SciREX/scirex/data/dataset_readers/scirex_full_reader.py", line 34, in clean_json_dict entities: List[Tuple[int, int, BaseEntityType]] = json_dict["ner"] KeyError: 'ner'

It seems that scirex_full_reader.py will reads all fields of the json file, So how could I fixed it?