Batch mode? - Githubissues

aqlaboratory / rgn2

101 stars 29 forks source link

Batch mode? #4

Closed HWaymentSteele closed 1 year ago

HWaymentSteele commented 1 year ago

Hi, are there any plans to make RGN2 available to run sequences in batch mode, either in a notebook or as downloadable source code? would be helpful for benchmarking. thank you!

christinaflo commented 1 year ago

Hi, you can use the source code currently to batch aminoBERT and RGN2 predictions. To do this, create a directory with your input fasta files and use the parse_fastas method followed by the aminobert_predict method in rgn2/aminobert/prediction.py, as opposed to aminobert_predict_sequence which is used in the notebook. The remainder of the workflow is the same.

HWaymentSteele commented 1 year ago

Thank you!

I am trying to set up on my own. I'm a little confused by this cell running protling.py -- what input from before does it need to read?

#@title Run RGN2
#This step generates the raw RGN2-predicted C-alpha trace.

rgn2_env_init = 'source /opt/conda/etc/profile.d/conda.sh && conda init && conda activate rgn2'
try:
  with io.capture_output() as captured:
    cmd = (f"python rgn/protling.py {os.path.join(RUN_DIR, 'configuration')} "
           f"-p -e 'weighted_testing' -a -g 0")
    %shell {rgn2_env_init} && {cmd}
except subprocess.CalledProcessError:
  print(captured)
  raise

print('Prediction completed!')

christinaflo commented 1 year ago

Sorry I must have missed this -- I assume you resolved this but the input needed is the path to the configuration file within the run directory. The aminobert step in the notebook will create the input TFRecord dataset for your sequences and put it in a data directory for RGN2, so no need to specify it in the run command.