farmeh / ComplexTome_extraction

Relation extraction system and trigger word detection for ComplexTome dataset.
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Running pre-trained model on unlabeled data #1

Open serenalotreck opened 7 months ago

serenalotreck commented 7 months ago

I'd like to apply your pretrained model to a set of .txt files I have. I've cloned the repo and downloaded the model using the instructions in the LargeScaleRelationExtractionPipeline README.

However, it's unclear to me how I would go about running the model on unseen data. I tried just running the script with the following:

python run_ls_pipeline.py \
 --configs_file_path ComplexTome_configs.json \
 --log_file_path logs/drought.log \
 --model_folder_path the_best_model \
 --input_folder_path /mnt/scratch/lotrecks/drought_
drought_and_des_1000_subset_15Apr2024/      drought_and_desiccation_combined_22Mar2024/ drought_full_dataset_06Mar2024/             
 --input_folder_path /mnt/scratch/lotrecks/drought_and_des_1000_subset_15Apr2024/ \
 --output_folder_path outputs/output_drought

where /mnt/scratch/lotrecks/drought_and_des_1000_subset_15Apr2024/ is a folder of .txt files. However, I'm getting a model path error that looks like its' because there's a hardcoded path somewhere, and am having trouble determining where it's coming from:

[2024-04-16 16:23:44] [<class 'helpers.logger.Logger'>.lp_halt]: 
      ********************************************************************************
      HALT REQUESTED BY FUNCTION: program_halt
      HALT MESSAGE: 
      Error loading the model. Error: Incorrect path_or_model_id: '/scratch/project_2001426/farmeh/bert_models/RoBERTa-large-PM-M3-Voc-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
      HALTING PROGRAM!!!
      ********************************************************************************

[2024-04-16 16:23:44] [<class 'large_scale_prediction_pipeline_tf.Large_Scale_Prediction_Pipeline_tensorflow'>.exit]: 
      EXITING PROGRAM ... 

Haven't found a hardcoded path in any of large_scale_prediction_pipeline_tf.py, ComplexTome_configs.json, or run_ls_pipeline.py.

Pointers appreciated!

farmeh commented 6 months ago

Hi, thanks for your question.

We have updated the codebase, so please update your git clone and follow the instructions in the updated READ.ME file. see here.

Basically, First, you need to run get_model.sh script to download the original model and also the best model weights into your a folder on your cluster/gpu-machine, and then provide it in --model_folder_path argument.

Second, the large-scale relation extraction program does not have an NER component inside, i.e., it will not work on pure texts. You need to run an NER system to detect Protein entities inside the texts, and once you do that then you can run the relation extraction system. (For example see here).

Third, the --input_folder_path that you provide, should have specific format. Inside, there can be different sub-folders, in each one or more .tar.gz file, each containing documents in BRAT standoff format (each document includes a .txt file for text, and one .ann file for the Protein entities inside). Check here.

If you follow aforementioned steps and still see problems please let us know, Thanks!