DennisSchmitz / Jovian_archive

Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
GNU Affero General Public License v3.0
18 stars 7 forks source link

Count_mapped_reads rule is extremely slow #99

Closed DennisSchmitz closed 4 years ago

DennisSchmitz commented 4 years ago

The count_mapped_reads rule is extremely slow. It's currently a bash script that is wrapped in snakemake... It works with only 1 thread and processes all bam files in an analysis. It does no splitting/parallelization.

This should be made parallel for each input bam file via Snakemake and we might use the parallel function to speed it up further.

DennisSchmitz commented 4 years ago

@samnooij can you look at this bash script? It seems like values are computed, and stored in variables, but they aren't actually printed or used? Am I misunderstanding something?

samnooij commented 4 years ago

Right, those variables don't do anything. They are more of a note to myself and the reader telling how you can count different things with samtools. The actual 'work' is done by the printf line (which makes a header), and the final samtools view [blablabla] line. The comments in the script should be updated to reflect this.

Also, I didn't really consider the speed when developing this. It was primarily the functionality that I had to get working. I think it took about half a minute per sample when I tested, which was fine. I can imagine it's longer with more/bigger/more complex samples, which is undesirable and unnecessary. Indeed parallelisation should be a good way to improve this. The easiest way to do that now would be to use bash parallelisation in the script. (I.e. the parallel command, or forking processes to the background in the for-loop.) Or would you prefer to have Snakemake handle everything (and probably make another rule to concatenate separate count tables)?

I'll see if I can fetch some test data and the pipeline to fix it this week.

DennisSchmitz commented 4 years ago

Aah ok. Yeah I tried it on the Cameroon dataset and there the rule took >1h, so that might be a good test dataset. I think the Snakemake method would be preferable, but depends on how easy it is. I haven't got the time to look into it yet. But by commenting out those variables and maybe by including the -@ flag of samtools will help a lot.

If you could look into it I'd by much obliged! Leme know if I can help.

samnooij commented 4 years ago

I have not tested it yet, but I proposed an update in https://github.com/DennisSchmitz/Jovian/commit/97e62a38cbd85058bf7732d8f1e7114aa081cb6a (mapped_reads branch). Can you test this easily? Otherwise I'll see if I can do it later.

DennisSchmitz commented 4 years ago

Tested it on a small demo set, after making the bash code compatible with Snakemake, it works. Tonight I'll try it with a larger dataset and afterwards merge it into dev.

Thanks!

DennisSchmitz commented 4 years ago

After trying it on a larger dataset (the public Cameroon one) it gives the following error in rule quantify_output:

Done counting!===============================================] 100.0% - Reading files  [ 189 / 189 ]
Traceback (most recent call last):
  File "bin/quantify_profiles.py", line 767, in <module>
    main()
  File "bin/quantify_profiles.py", line 653, in main
    classified_nrs = sum_superkingdoms(arguments.classified, arguments.mapped_reads)
  File "bin/quantify_profiles.py", line 318, in sum_superkingdoms
    [[ "mapped_reads" ]])
  File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['mapped_reads'] not in index"

When I inspect the file results/counts/Mapped_read_counts.tsv I see that the file is "weird" (see below). It looks like there is one block without a filename in the first column while latter lines do have this structure. @samnooij, is this filestructure intended?

mapped_reads    scaffold_name
     59     NODE_1000_length_878_cov_0.407457
     55     NODE_1001_length_878_cov_0.259654
    788     NODE_1002_length_877_cov_6.033333
    140     NODE_1003_length_877_cov_1.030667
..........
..........
     74     NODE_999_length_878_cov_0.548602
    456     NODE_99_length_2032_cov_1.351706
   6354     NODE_9_length_5644_cov_7.284394
results/counts/Mapped_read_counts-SRR7892427.tsv:     58    NODE_1000_length_1051_cov_0.247835
results/counts/Mapped_read_counts-SRR7892427.tsv:    794    NODE_1001_length_1050_cov_4.755146
results/counts/Mapped_read_counts-SRR7892427.tsv:     52    NODE_1002_length_1050_cov_0.380282
..........
..........
DennisSchmitz commented 4 years ago

It's indeed the wonky filestructure. If I make the file like below, it does work. I`ll try to update it into the Snakefile next week.

awk 'FNR==1 && NR!=1 { while (/^mapped_reads/) getline; } 1 {print}' results/counts/Mapped_read_counts-*.tsv > results/counts/Mapped_read_counts.tsv 

After making the file like that, no errors.

@samnooij, if I understand the code correctly you match on the assumption that the scaffold name is unique? While the odds are good that they are all unique, that's not a certainty. Can the filename be added to force it to be unique?

samnooij commented 4 years ago

Indeed, I assume that there will be no scaffolds with the same number, length and coverage. It would be good to still include a check whether this is the case and make sure the code is robust even to unlikely events such as those. Even if that makes the script a bit more complicated. I am going to work on an update that does this.

And as you guessed, the file is supposed to be a two-columns tab-separated table with the number of mapped reads and the scaffold name as two columns. I am going to add a third with sample/file name to exclude the possibility of duplicated scaffold names.

DennisSchmitz commented 4 years ago

Perfect thank you! Can you also included the above 'awk' fix? I'm very tight on time and if you're fixing the other bug anyway it would really help me if you also included that. If it's not possible, leme know :)

samnooij commented 4 years ago

The double curly braces for snakemake compatibility? I'm thinking about changing the rule to a new Python script to include duplicate checking together with the concatenation of all those separate tables. That seems more logical and probably easier to me than adding lines of shell commands. So that won't use the awk line anymore.

I will have a look at bin/quantify_profiles.py too, as this will then also have to include the file/sample names to prevent errors due to duplicated scaffold names.

DennisSchmitz commented 4 years ago

Perfect! Whatever works best for you :)

DennisSchmitz commented 4 years ago

I saw you pushed 2ea9819b and 1ede353a for this issue, is it in a testable state? Because I wanna test and release the stable IGVjs fix and then I can also take along this fix.

samnooij commented 4 years ago

It is almost. Well, in fact I think it works already, yes. I am testing it too - with an additional update in quantify_profiles.py to merge tables on both scaffold_name and sample_name (to avoid errors with duplicated names). And that part works now. However, I am getting problem with draw_heatmaps.py again, as in #42. I will make a push with my little new update and then you can test it too. I am hoping to get a grip on the heatmaps error... :\

P.S. I think I'm closing in on the draw_heatmaps.py bug... At one point it makes a table of number of read pairs / sample. In my example, the sample names end up like "SRR7892426_", with an underscore attached. The consequent merger of dataframes, by Sample name, fails because of this. Therefore, the fix is probably making the sample column _R1/1 resistant by applying the same RegEx trick. (getting a bit off-topic here...)_

P.P.S This issue seems to have been solved in the master branch, but looks defective in the dev branch! I am now going to try the code from the master branch and see if that solves my current testing issue.

samnooij commented 4 years ago

In https://github.com/DennisSchmitz/Jovian/commit/c7fe25e5d3d432b66e353e49efffcd3e3ce3c23e I took the draw_heatmaps.py script from the dev branch, and changed the lines for def read_numbers(infile) to the same function as it is on the master branch. With that, I was able to run a complete analysis. In other words: it all works now. Please double check if it works for you too. Hopefully we can finally get that bug out and speed up the counting. ;)

DennisSchmitz commented 4 years ago

I did a small test, it seems to work. I'll do a bigger test overnight and let you know tomorrow.

DennisSchmitz commented 4 years ago

Also works on larger datasets, it's included in the dev branch and will be in the next release.