MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

inStrain compare - output #164

Open marina-vn opened 11 months ago

marina-vn commented 11 months ago

I'm running the program "instrain Compare" (v. 1.8.0) and it only generates one file in the output directory (MG_mapped_only_iSp_COMPARE_comparisonsTable.tsv) and no figures.

The command that I'm running is:

inStrain compare -i BL090113_mapped_only_iSp/ BL090210_mapped_only_iSp/ BL090317_mapped_only_iSp/ BL090421_mapped_only_iSp/ BL090512_mapped_only_iSp/ BL090609_mapped_only_iSp/ BL090721_mapped_only_iSp/ BL090804_mapped_only_iSp/ BL090915_mapped_only_iSp/ BL091006_mapped_only_iSp/ BL091105_mapped_only_iSp/ BL091222_mapped_only_iSp/ BL100120_mapped_only_iSp/ BL100217_mapped_only_iSp/ BL100322_mapped_only_iSp/ BL100413_mapped_only_iSp/ BL100525_mapped_only_iSp/ BL100622_mapped_only_iSp/ BL100706_mapped_only_iSp/ BL100803_mapped_only_iSp/ BL100914_mapped_only_iSp/ BL101019_mapped_only_iSp/ BL101116_mapped_only_iSp/ BL101214_mapped_only_iSp/ BL110112_mapped_only_iSp/ BL110208_mapped_only_iSp/ BL110314_mapped_only_iSp/ BL110412_mapped_only_iSp/ BL110517_mapped_only_iSp/ BL110615_mapped_only_iSp/ BL110704_mapped_only_iSp/ BL110802_mapped_only_iSp/ BL110913_mapped_only_iSp/ BL111010_mapped_only_iSp/ BL111129_mapped_only_iSp/ BL111219_mapped_only_iSp/ BL120110_mapped_only_iSp/ BL120214_mapped_only_iSp/ BL120313_mapped_only_iSp/ BL120411_mapped_only_iSp/ BL120518_mapped_only_iSp/ BL120620_mapped_only_iSp/ BL120703_mapped_only_iSp/ BL120807_mapped_only_iSp/ BL120913_mapped_only_iSp/ BL121009_mapped_only_iSp/ BL121106_mapped_only_iSp/ BL121211_mapped_only_iSp/ BL130115_mapped_only_iSp/ BL130206_mapped_only_iSp/ BL130312_mapped_only_iSp/ BL130417_mapped_only_iSp/ BL130507_mapped_only_iSp/ BL130604_mapped_only_iSp/ BL130709_mapped_only_iSp/ BL130801_mapped_only_iSp/ BL130917_mapped_only_iSp/ BL131015_mapped_only_iSp/ BL131105_mapped_only_iSp/ BL131204_mapped_only_iSp/ BL140114_mapped_only_iSp/ BL140211_mapped_only_iSp/ BL140310_mapped_only_iSp/ BL140407_mapped_only_iSp/ BL140505_mapped_only_iSp/ BL140602_mapped_only_iSp/ BL140707_mapped_only_iSp/ BL140804_mapped_only_iSp/ BL140916_mapped_only_iSp/ BL141007_mapped_only_iSp/ BL141111_mapped_only_iSp/ BL141216_mapped_only_iSp/ BL150113_mapped_only_iSp/ BL150210_mapped_only_iSp/ BL150310_mapped_only_iSp/ BL150415_mapped_only_iSp/ BL150512_mapped_only_iSp/ BL150609_mapped_only_iSp/ BL150707_mapped_only_iSp/ BL150804_mapped_only_iSp/ BL150916_mapped_only_iSp/ BL151006_mapped_only_iSp/ BL151110_mapped_only_iSp/ BL151215_mapped_only_iSp/ -p 30 -d -o MG_mapped_only_iSp_COMPARE2

And the it seems to without any issue, but it doesn't generate the output that I need:


***************************************************
    ..:: inStrain compare Step 1. Load data ::..
***************************************************

Loading Profiles into RAM: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [00:00<00:00, 219.56it/s]
1 of 1 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************

Running group 1 of 1
Comparing scaffolds: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:51<00:00, 231.64s/it]
***************************************************
..:: inStrain compare Step 3. Auxiliary processing ::..
***************************************************

***************************************************
..:: inStrain compare Step 4. Store results ::..
***************************************************

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

..:: inStrain compare finished ::..

Output tables........ /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE2/output/
Figures.............. /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE2/figures/
Logging.............. /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE2/log/

See documentation for output descriptions - https://instrain.readthedocs.io/en/latest/

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

..:: Overall ::..
InStrain version 1.8.0 started at 2023-10-31 15:51:19 and ended at 2023-10-31 15:55:13.
Runtime = 3 minutes, 54 seconds
Command = /home/emm4/anaconda3/envs/inStrain2/bin/inStrain compare -i BL090113_mapped_only_iSp/ BL090210_mapped_only_iSp/ BL090317_mapped_only_iSp/ BL090421_mapped_only_iSp/ BL090512_mapped_only_iSp/ BL090609_mapped_only_iSp/ BL090721_mapped_only_iSp/ BL090804_mapped_only_iSp/ BL090915_mapped_only_iSp/ BL091006_mapped_only_iSp/ BL091105_mapped_only_iSp/ BL091222_mapped_only_iSp/ BL100120_mapped_only_iSp/ BL100217_mapped_only_iSp/ BL100322_mapped_only_iSp/ BL100413_mapped_only_iSp/ BL100525_mapped_only_iSp/ BL100622_mapped_only_iSp/ BL100706_mapped_only_iSp/ BL100803_mapped_only_iSp/ BL100914_mapped_only_iSp/ BL101019_mapped_only_iSp/ BL101116_mapped_only_iSp/ BL101214_mapped_only_iSp/ BL110112_mapped_only_iSp/ BL110208_mapped_only_iSp/ BL110314_mapped_only_iSp/ BL110412_mapped_only_iSp/ BL110517_mapped_only_iSp/ BL110615_mapped_only_iSp/ BL110704_mapped_only_iSp/ BL110802_mapped_only_iSp/ BL110913_mapped_only_iSp/ BL111010_mapped_only_iSp/ BL111129_mapped_only_iSp/ BL111219_mapped_only_iSp/ BL120110_mapped_only_iSp/ BL120214_mapped_only_iSp/ BL120313_mapped_only_iSp/ BL120411_mapped_only_iSp/ BL120518_mapped_only_iSp/ BL120620_mapped_only_iSp/ BL120703_mapped_only_iSp/ BL120807_mapped_only_iSp/ BL120913_mapped_only_iSp/ BL121009_mapped_only_iSp/ BL121106_mapped_only_iSp/ BL121211_mapped_only_iSp/ BL130115_mapped_only_iSp/ BL130206_mapped_only_iSp/ BL130312_mapped_only_iSp/ BL130417_mapped_only_iSp/ BL130507_mapped_only_iSp/ BL130604_mapped_only_iSp/ BL130709_mapped_only_iSp/ BL130801_mapped_only_iSp/ BL130917_mapped_only_iSp/ BL131015_mapped_only_iSp/ BL131105_mapped_only_iSp/ BL131204_mapped_only_iSp/ BL140114_mapped_only_iSp/ BL140211_mapped_only_iSp/ BL140310_mapped_only_iSp/ BL140407_mapped_only_iSp/ BL140505_mapped_only_iSp/ BL140602_mapped_only_iSp/ BL140707_mapped_only_iSp/ BL140804_mapped_only_iSp/ BL140916_mapped_only_iSp/ BL141007_mapped_only_iSp/ BL141111_mapped_only_iSp/ BL141216_mapped_only_iSp/ BL150113_mapped_only_iSp/ BL150210_mapped_only_iSp/ BL150310_mapped_only_iSp/ BL150415_mapped_only_iSp/ BL150512_mapped_only_iSp/ BL150609_mapped_only_iSp/ BL150707_mapped_only_iSp/ BL150804_mapped_only_iSp/ BL150916_mapped_only_iSp/ BL151006_mapped_only_iSp/ BL151110_mapped_only_iSp/ BL151215_mapped_only_iSp/ -p 30 -o MG_mapped_only_iSp_COMPARE2 -d

..:: Checkpoints ::..

..:: Filter reads report ::..

..:: Profile report ::..

* Profiling splits *

* Merging splits and profiling genes *

..:: Geneome level report ::..

..:: Plotting ::..

..:: Compare ::..
CreateScaffoldComparisonObjects took <1 second       ( 0.4% of overall) RAM went from 162.02 MB to 163.21 MB (increased by 1.20 MB)
multiprocessing      took 3.0 minutes, 52.0 seconds (99.1% of overall)  RAM went from 163.21 MB to 305.41 MB (increased by 142.20 MB)
SaveResults          took <1 second       ( 0.4% of overall)    RAM went from 306.69 MB to 307.27 MB (increased by 600.00 KB)

Wall time                       3 minutes, 49 seconds
Total processes used            1
Average number processes used   1.0
Paralellization efficiency      100.0%
Units profiled                  1
Average time per unit           3.0 minutes, 49.0 seconds
Median time per unit            3.0 minutes, 49.0 seconds
Maximum unit time               3.0 minutes, 49.0 seconds
Longest running unit            vSAG-37-F6
Per-process efficiency          ['100.0']
unit per-process strating RAM           ['214.34 MB']
unit per-process final RAM              ['214.34 MB']
unit per-process minimum RAM            ['214.34 MB']
unit per-process maximum RAM            ['214.34 MB']

..:: Failures ::..
No failures

Any idea of what could be the issue and hot to fix it?

Best,

log.log

MrOlm commented 11 months ago

Hello @marina-vn - could you let me know what files you'd like to have and are missing? Also, are you mapping to a single genome or multiple genomes?

Thanks, Matt

marina-vn commented 11 months ago

Hi Matt, I'm mapping a temporal serie to a single genome and I was hopping to get some figure that shows a comparison of the samples based on SNVs, to see how the population changes month by month.

I'm looking into the "plot" module, but I'm not sure on how to use it with the "compare" results. It says that "Skipping plot 10 - you don't have all required information. You need to run inStrain genome_wide first" but in the manual it reads that if you have provided the gene file in the "profile" step is better, but thought I have done that I'm getting this message. Also, I'm not sure where it's generating the other 9 plots, because thay aren't in the directory that I'm providing not in the one that I'm working in

MrOlm commented 11 months ago

Hi @marina-vn - OK this makes sense. If you want to dig into this on a SNV-by-SNV level I would consider also adding .bam files to your inStrain compare command in order to run the SNV Pooling functions.

1) I think this is a small bug with the inStrain compare module, where it wants to have a scaffold to bin file to make that plot. I think the best workaround would just be to make an .stb file that just has you genome, and pass that to inStrain compare. That should make it make plot 10. Info on that is here: https://instrain.readthedocs.io/en/master/user_manual.html#preparing-a-scaffold-to-bin-file

2) Plots 1-9 are generated within inStrain profile, not inStrain compare.

Best, Matt

marina-vn commented 11 months ago

Hi Matt

I have added the bams files an the scaffold to bin, but I'm getting this error:

Traceback (most recent call last):
  File "/home/emm4/anaconda3/envs/inStrain2/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/controller.py", line 57, in main
    self.compare_operation(args)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/controller.py", line 89, in compare_operation
    inStrain.compare_controller.CompareController(args).main()
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/compare_controller.py", line 46, in main
    self.parse_arguments()
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/compare_controller.py", line 150, in parse_arguments
    self.name2bam = gen_name_2_bam(self.inputs, args.bams, self.kwargs.get('processes', 1))
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/compare_controller.py", line 679, in gen_name_2_bam
    name2bam[name] = inStrain.profile.samtools_ops.prepare_bam_fie(bam, processes)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/profile/samtools_ops.py", line 61, in prepare_bam_fie
    pysam.AlignmentFile(bam).mapped
  File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 958, in pysam.libcalignmentfile.AlignmentFile._open
  File "pysam/libchtslib.pyx", line 361, in pysam.libchtslib.HTSFile.check_truncation
OSError: no BGZF EOF marker; file may be truncated

I have installed the program this afternoon using the following command:

conda install -c conda-forge -c bioconda -c defaults instrain

Is there anything else that I should doregarding the installation to correct this error?

MrOlm commented 11 months ago

Hi @marina-vn - that error means that pysam doesn't like something about the .bam files you're providing. If you provide the same .bam files that you provided to inStrain profile it should work (since it's the same pysam code in both inStrain profile and inStrain compare)

Best, Matt

marina-vn commented 11 months ago

Hi @MrOlm ,

I have submitted the bams in the same order than the profiles and the scaffold-to-bin file, and now I'm getting this error


..:: inStrain compare Step 3. Auxiliary processing ::..


Traceback (most recent call last): File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3653, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'cryptic'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/emm4/anaconda3/envs/inStrain2/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/controller.py", line 57, in main
    self.compare_operation(args)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/controller.py", line 89, in compare_operation
    inStrain.compare_controller.CompareController(args).main()
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/compare_controller.py", line 73, in main
    self.run_auxillary_processing()
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/compare_controller.py", line 297, in run_auxillary_processing
    PM = inStrain.polymorpher.PoolController(self.SC_objects, self.name2bam)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/inStrain/polymorpher.py", line 65, in __init__
    db = db[db['cryptic'] == False]
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/emm4/anaconda3/envs/inStrain2/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3655, in get_loc
    raise KeyError(key) from err
KeyError: 'cryptic'
MrOlm commented 11 months ago

huh- could you send me the log file please? and can you confirm that each of your inStrain profile objects has at least 1 SNV called in the SNV output tables?

Best, Matt

marina-vn commented 11 months ago

Hi,

I have checked the profiles and removed two that didn't have any SNV, run it again and it generated the output but not the figure:

`*** ..:: inStrain compare Step 1. Load data ::..


Loading Profiles into RAM: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:01<00:00, 42.79it/s] 1 of 1 scaffolds are in at least 2 samples


..:: inStrain compare Step 2. Run comparisons ::..


Running group 1 of 1 Comparing scaffolds: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:33<00:00, 213.40s/it]


..:: inStrain compare Step 3. Auxiliary processing ::..


Pulling 74482 SNVs from 82 scaffolds. This should take ~ 1.4 min in total Pulling SNVs from BAMs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:03<00:00, 23.05it/s] Generating pooled SNV summary table: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.46it/s] Cannot cluster genome 37F6.fasta; 14 of 3321 comaprisons involve no genomic overlap at all: see log for more


..:: inStrain compare Step 4. Store results ::..


making plots 10 Plotting plot 10 Cannot cluster genome 37F6.fasta; 14 of 3321 comaprisons involve no genomic overlap at all: see log for more Done! $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

..:: inStrain compare finished ::..

Output tables........ /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE_bams_noempty/output/ Figures.............. /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE_bams_noempty/figures/ Logging.............. /scr/marina_scr/MG20/bamreads/MG_mapped_only_iSp_COMPARE_bams_noempty/log/

See documentation for output descriptions - https://instrain.readthedocs.io/en/latest/

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$`

If I remove the 3 different samples that appear to be a problem (Cannot cluster genome 37F6.fasta; 14 of 3321 comaprisons involve no genomic overlap at all: see log for more) it does generate the dendogram, but I was wondering why those samples suppose a problem when they are correctly mapped and are present in all the output files

Also, is there a way to represent the similarty between samples that isn't the dendogram? With 82 different samples, it's not a very legible way of understanding the results, I was wondering if there's an easy way of generaint a double entry table/heatmap that depicts the similarity between the samples

Best,

noempty_log.log

MrOlm commented 11 months ago

Hi @marina-vn -

Yes- I agree with that many samples a dendrogram can become very unweildy! I would recommend making heatmaps from the genomewide-compare.tsv (https://instrain.readthedocs.io/en/latest/example_output.html#genomewide-compare-tsv) . The link above should make it pretty straightforward for you to do so, but please let me know if you run into trouble.

Best, Matt