dkoslicki / MetaPalette

Metagenomic profiling and phylogenetic distances via common kmers
Other
42 stars 5 forks source link

Code just not working #2

Closed jregalad closed 8 years ago

jregalad commented 8 years ago

Very good idea but implementation is not the best. Took me quite a while to get Training.py to run. Added some error handling to the subprocess.checkcall() commands. Also, code does not work with new versions of bcalm.

Main bug is with Classify.py:

On line 103 there is a `def form_y(file_name, kmer_size):

cmd = query_per_sequence_binary +" "+ os.path.join(output_folder,file_base_name+"-"+str(kmer_size)+"mers.jf") + " " + os.path.join(data_dir,"Bcalms",file_name+"-30mers.bcalm.fa")

res = subprocess.check_output(cmd, shell = True)

return int(res)`

Why are you returning int(res) when res stores the output of query_per_sequence which is a string (fasta)? https://github.com/gmarcais/Jellyfish/tree/master/examples/query_per_sequence

After this code fails because the array Y can not be created.

Hope we can sort this out, Julian

dkoslicki commented 8 years ago

Hi Julian,

I understand the implementation isn’t the best (time/skill/priority constraints), but that is one reason why the Docker container has the --shell command. If you’d like to play with the code pre-installed, use docker run -i -t dkoslicki/metapalette --shell.

There are a number of reasons why I chose not to use the current version of Bcalm (BCALM 2):

  1. It was not out when I implemented the Bcalm portions of the code, and I didn’t have the time to make it work with the new version.
  2. BCALM 2 uses a new, unpublished algorithm that I have not been able to check out yet.
  3. BCALM 2 requires the GATB-CORE which is a rather large package containing all sorts of third-party dependencies, other package requirements, etc. and I wanted to limit these requirements in MetaPalette.
  4. The CLI of BCALM 2 is quite different than BCALM 1, indicating that BCALM 2 is doing a lot more than finding all the simple paths (see point 2 above). BCALM 1 just finds the simple paths, and this is what I needed.

As for the query_per_sequence code, if you take a look at MetaPalette/src/QueryPerSeq/query_per_sequence.cc, you will see that the output is indeed an integer (not a string as in https://github.com/gmarcais/Jellyfish/tree/master/examples/query_per_sequence). I’ve modified the query_per_sequence.cc code to return just what I need (the counts). Hence why the readme says to compile this code (and doesn’t point to the Jellyfish repository).

As for the bugs you found, I would appreciate a pull request if you would like to contribute your fixes. Otherwise feel free to let me know what the other main issues with Train.py were and I can see if I can resolve them.

Thanks.