KennthShang / HostG

Graph convolutional neural network for host prediction
19 stars 2 forks source link

Currently necessary to run HostG from within the HostG directory? #2

Closed mlhoggard closed 2 years ago

mlhoggard commented 2 years ago

Hi there,

Thanks for your work on HostG. It looks like a great tool, and I'm keen to give it a test run with some current data we're working on.

I just wanted to check if I'm missing something, but from what I can tell, is it currently necessary to run the tool from within the HostG/ directory?

I.e. The current calls to other python scripts in the format cmd = "python run_CNN.py" appear to only look within the working directory for run_CNN.py even if HostG/ is added to $PATH. Similarly, the call to the database (dataset/) appears to be hardcoded as being within the current directory (e.g. within run_KnowledgeGraph.py: pkl.load(open("dataset/phage2id.dict",'rb')) ).

Thanks again for all your work on this, and I'm looking forward to seeing how the outputs from our data look.

Kind regards, Mike.

(p.s. I installed simply by cloning the repo rather than via anaconda, but perhaps it was written with the assumption that it only be run from directly within the conda environment?)

KennthShang commented 2 years ago

Ummm. Yes, currently the CNN model can only be run in the given model because the path of the CNN script is hardcoded.

If you are using an HPC, you can sbatch your job under the HostG folder. That will work.

Best, Jiayu

mlhoggard commented 2 years ago

Hi Jiayu,

Thanks for the quick reply.

Ok, I'll continue running from within the HostG/ directory, but if possible for a future update that would be great thanks, as it's currently a bit problematic on a shared system, as only one dataset can be run at a time and the directory for the program itself also gets a bit jumbled intermingled with all of the output files.

I had a quick look at if I could patch this quickly for our system, but as there's a number of subscripts with the dataset/ directory and all output directories and paths hardcoded it got a bit trickier for me not being fully familiar with how all of the scripts interrelate. But if of interest, some possible modifications could include:

  1. Generating a variable of the path to the HostG scripts based on the run_Speed_up.py path to then include in front of all calls to other scripts. E.g. possibly via something like: HostG_path = str(os.path.dirname(os.path.abspath(sys.argv[0]))), followed by subsequent calls in the format cmd = str(HostG_path)+"/run_CNN.py" (n.b. python call dropped assuming note 3 below).
  2. The variable above could also be passed as an argument to the subscripts to update the currently hardcoded paths for all model files (e.g. the CNN model) and the dataset/ directory path. (e.g. from what I could tell, dataset/ is currently hardcoded in run_CNN.py, run_KnowledgeGraph.py, run_phage_host.py, and run_phage_phage.py.)
  3. Adding the header line #!/usr/bin/env python3 to the top of each of the python scripts would also enable all calls within the scripts to omit the direct call to python (e.g. updated to the format cmd = "run_CNN.py"). This would also enable searching $PATH for the scripts rather than only the working directory (although, being able to search $PATH is less necessary for all of the subscripts if note 1 above was implemented, but would still be very useful for run_Speed_up.py).
  4. Similarly, it would be fantastic if an additional argument could be included to provide the path to the dataset/ directory, with a default setting of whatever the full path to run_Speed_up.py is. (i.e. Additional argument added to run_Speed_up.py for the database, the path of which is then passed to whichever other scripts require the database path). This would allow for both running the program from somewhere other than the HostG/ directory, and would also allow for storing the database in a separate directory from the program (as might be preferable in some instances).

Thanks again for the reply. I had a quick follow up question regarding outputting the threshold scores, but I will open a new thread for that one.

Kind regards, Mike.