chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
44 stars 9 forks source link

Take into account incomplete bins, to detect which cogs should be ignored. #76

Closed Sebastien-Raguideau closed 4 years ago

Sebastien-Raguideau commented 4 years ago

From issue #75

I created a cog_overlap branch, I implemented taking into account cogs from bins which are not mags : Now beware that all of the bins in the subgraph folders are not mags. Instead it is when I output the selected_bins.txt that I filter for mags. If a merged bin does contain at least one mag, it is selected. Additionally I had a look at the criterion for detecting that 2 bins share the same graph on 1 cogs. It is indeed 10% at the moment, but it is 10% of unitigs and not 10% of nucleotides. Shall I change that? You can now adjust this 10% threshold in the config file under : bayespaths:     percent_unitigs_shared: 0.1

I relaunched from scratch on SyntheticCommunity32/Seb_cog_overlap. I'll update if anything goes wrong, but it was working fine up to bayesgraph (didn't test yet take too much time).

chrisquince commented 4 years ago

This looks great Seb - I will try it now :)

chrisquince commented 4 years ago

OK that run failed for me, with error:


MissingOutputException in line 117 of /mnt/gpfs/chris/Projects/STRONG_Runs/Benchmarking/ResultsF/STRONG/COG_pipe/HeavyLifting.snake: Missing files after 120 seconds: subgraphs/bin_merged/selected_bins.txt

In case you want to look the run is here:

/mnt/gpfs/chris/Projects/STRONG_Runs/Benchmarking/ResultsF/Synth_G45_S10

It would be a lot easier to debug the relevant rule here:

rule select_bins_for_strain_analysis: input: select = merged_bins_cogs, mags = "binning/list_mags.tsv", merge_recipe = "subgraphs/bin_init/bins_to_merge.tsv" output: "subgraphs/bin_merged/selected_bins.txt" message: "Select bins containing sufficient number of COGs" run:

select bin_name only, not full path

  bins = [os.path.basename(os.path.dirname(cogs_fn)) for cogs_fn in input["select"] \
          if len(list(open(cogs_fn).readlines())) >= MIN_ORF_NUMBER_TO_RUN_A_BIN]
  # we have been considering indistinctly bins and mags previously, we will now only retain mags
  mags = ["Bins_%s" % line.rstrip() for line in open(input["mags"])]
  merged_recipe = {line.rstrip().split('\t')[0]:line.rstrip().split('\t')[1:] for line in open(input["merge_recipe"])}
  mags += [merged_bin for mergerd_bin,list_bins in merged_recipe.items() if (set(mags)&set(merged_recipe))!=set()]
  bins_selected=list(set(mags)&set(bins))
  with open(output[0], 'w') as out:
      out.write("\n".join(bins))

If it was a standalone python script?

Sebastien-Raguideau commented 4 years ago

I found the issue, it is again about how the input, output should be accessed.... So accessing output[0], is causing the issue. What works instead is to name the output and call it using it's name. Not sure what output[0] is interpreted as here. I mean, the result is written somewhere otherwise there would be an error message with opening the file. Not an error message about how we can't find the output file. Anyway. I corrected it and pushed it. I am also surprised I can't replicate this issue on my initial test data.... They should not allow accessing input and output with index if it does not have consistent outcome.

I got a bias against putting code in a python script if it means you'll add more line dealing with passing argument, than the initial code itself. Furthermore, I don't agree that debugging python code in snakemake is harder.

chrisquince commented 4 years ago

OK I will try the debugged version, thanks.

Can you step through snakemake in a debugger like ipdb?

Sebastien-Raguideau commented 4 years ago

no, and I usually do not debug using ipdb, maybe that's why I don't feel it's easier to debug python code in and out of snakemake :)

chrisquince commented 4 years ago

You really should step through code with a debugger, it is how to check that it is actually doing what you think it should be doing. It might be why we have so many bugs :(

This still does not run for me and gives the same error. This afternoon I will write a script to do this step and then debug it.

Sebastien-Raguideau commented 4 years ago

Chris, not using a debugger does not mean i don't debug. I usually use interactive python session to do so.

Ok, that is not cool. Can you send me the path to the STRONG repo you're using so I can try to replicate, also are you using an env? I don't think this is an issue with python code : it says it can't find the outputed file, it doesn't mean the file is empty or wrong, it just does not find it.

chrisquince commented 4 years ago

ipdb is an interactive debug session. My point was that you cannot step through snakemake code making it hard to debug. The pipeline ran this time. Which is odd. I will test it again but I want to understand what you are doing here. Are you available to skype?

Sebastien-Raguideau commented 4 years ago

I just had a look, i did some manual debugging, using a print on the first line of python code. The python code was not executed in snakemake 5.4.5 but was executed in 5.6.0. On the other hand by putting this snipset of code in a script, it was executed in 5.4.5.
It is likely that the output[0] had nothing to do with the issue, I just activated my env between 2 tests. I'm on skype.

chrisquince commented 4 years ago

It looks like the MAG selection part is not working, all bins are getting copied from bin_init to bin_merged not just MAGs!

Sebastien-Raguideau commented 4 years ago

It is not a bug, it's a feature. We did not finish the code explanation yesterday on skype, the selection is done after the merging process, when we output a file called selected_bins.txt. That is, by the way, what the python code I put in a script is doing.

chrisquince commented 4 years ago

Sure but then the BayesPaths snake and DESMAN need to be updated to take that file?

chrisquince commented 4 years ago

Also do we have all the correct behaviour here we should not merge bad bins into MAGs and we should flag cogs that post merging are still bad?

chrisquince commented 4 years ago

and the selected_bins.txt file is larger than the MAGs list so this is definitely a bug!

Sebastien-Raguideau commented 4 years ago

So, I agree on Desman, there is an issue, I'll fix it.

BayesPaths depends already on this file. Actually this is the control mechanism we implemented so that you could launch BayesPaths on whichever bins you wanted.

We reflag cogs postmerging, so there is no issue here.

Ok, we don't want to merge bad bins? It's not obvious to me. What does merging do, it will add additional components on non commons cogs, which should not hinder reconstruction of the mag. I mean it's like adding an additional strain or 2 on the graph? Only issue is that it will be incomplete or some cogs won't be from the same species, meaning you'll get fragmentary strains.

The difference in size comes from the fact one list bins numbers the over one list names such as Bin_1. No bug.

chrisquince commented 4 years ago

Exactly what format is used for the file: 'binning/list_mags.tsv' Is it different to the convention in? 'bayespaths/selected_bins.txt' Because they don't agree. If it is a convention thing the we should use the same for both. It is very confusing!

chrisquince commented 4 years ago

Regarding bad bins, I mean not MAGs, we definitely don't want to merge non-MAGs into MAGs as it will potentially turn our MAG into a bad bin?

Sebastien-Raguideau commented 4 years ago

One is using the concoct convention : a bin is a number, The other one is using folder names, which is helpful when we want to run bayespath.

Any merged bin will turn in a bad bin from the fact it will have at least 10 SCG in double. That's not a good reason. A good reason is that it may impede bayespath or create additional complexity in interpreting the output of bayespath, since some strains are going to be defined on only one cog, coming from no good bin. I'll have a look at changing that.

chrisquince commented 4 years ago

OK but if we add 'Bin_' to the binning/list_mags.tsv file we should get bayespaths/selected_bins.txt at least on my example run we did not. Now things could have changed with the update but I think I updated before it got to the point. So I will rerun but can you just check your test run?

I think you are misunderstanding my question. Only merge MAGs not all bins that is all I am saying.

Sebastien-Raguideau commented 4 years ago

Ok, sorry for the confusion, I see the same thing. I just found the issue and pushed.

Ok I'm confused now. At the moment a merged bin is selected only if there is at least one mag in it. However it is totally possible to have one merged bin with for instance one mag and multiple bad bins. I though you were saying that in this case, we don't want to have a merge event but just cogs flagged in the initial mag.

chrisquince commented 4 years ago

I don't think you have pushed the edited snake which actually uses this script :( Use git diff to check the state of your repo.