CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
473 stars 188 forks source link

get the Segmentation fault from umi_tools group command line #193

Closed mei2000 closed 5 years ago

mei2000 commented 6 years ago

I have trouble to run the umi_tools group command line, and it always run into the segmentation fault. I just updated the umi_tools from 0.4.4 version to 0.5.0 version with the pip command. Here is my pip command line: pip install --user --upgrade mui_tools Collecting umi_tools Using cached umi_tools-0.5.0.tar.gz Requirement already up-to-date: setuptools>=1.1 in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: numpy>=1.7 in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: pandas>=0.12.0 in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: future in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: regex in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: scipy in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: matplotlib in ./.local/lib/python3.5/site-packages (from umi_tools) Requirement already up-to-date: python-dateutil>=2 in ./.local/lib/python3.5/site-packages (from pandas>=0.12.0->umi_tools) Requirement already up-to-date: pytz>=2011k in ./.local/lib/python3.5/site-packages (from pandas>=0.12.0->umi_tools) Requirement already up-to-date: cycler>=0.10 in /usr/prog/python/3.5.1-goolf-1.5.14-NX/lib/python3.5/site-packages/cycler-0.10.0-py3.5.egg (from matplotlib->umi_tools) Requirement already up-to-date: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in ./.local/lib/python3.5/site-packages (from matplotlib->umi_tools) Requirement already up-to-date: six>=1.10 in ./.local/lib/python3.5/site-packages (from matplotlib->umi_tools) Installing collected packages: umi-tools Running setup.py install for umi-tools ... done Successfully installed umi-tools-0.5.0

here is my umi_tools group command line: [1]+ Segmentation fault ~geru1/.local/bin/umi_tools group -I input.bam --paired --group-out=input.tsv -L logfile.txt --output-bam -S output.bam

Please help me to resolve these issues. Thank you,

Mei

IanSudbery commented 6 years ago

Hi Mei,

Thanks for the report, hopwfully we will be able to help you. Can you let us know

1) How much memory is available 2) The first 20 and last 20 lines of the logfile 3) the output of samtools quickcheck input.bam 4) the output of samtools view input.bam | head 5) How big your BAM file is.

Cheers,

Ian

mei2000 commented 6 years ago

Hi Ian,

Here is my answer to your list question as follow: 1) I ran this bam file on our big server machine which has 1GB RAM with 96 cores, and I am only one use this machine now. 2) first 20 lines of logfile:

output generated by group -I /db/dmp/rge/Crispr/on-target_doc/project_2017/expID_2941_QiaSeq/benchMark_smCounterTool/bam_clean/ZF-91-OH77.bam --paired --group-out=ZF-91-OH77.tsv -L group_nodedup_log.txt --output-bam -S ZF-91-OH77.bam

job started at Wed Oct 11 17:21:16 2017 on nrusca-sld0136 -- 3aa708f4-314f-4a46-aed3-6dad44e1df21

pid: 120635, system: Linux 2.6.32-696.6.3.el6.x86_64 #1 SMP Fri Jun 30 13:24:18 EDT 2017 x86_64

cell_tag : None

chrom : None

compresslevel : 6

detection_method : None

gene_tag : None

gene_transcript_map : None

get_umi_method : read_id

ignore_umi : False

in_sam : False

log2stderr : False

loglevel : 1

mapping_quality : 0

method : directional

no_sort_output : False

out_sam : False

output_bam : True

output_unmapped : False

2) last 20 lines of logfile:

output generated by group -I /db/dmp/rge/Crispr/on-target_doc/project_2017/expID_2941_QiaSeq/benchMark_smCounterTool/bam_clean/ZF-91-OH77.bam --paired --group-out=ZF-91-OH77.tsv -L group_nodedup_log.txt --output-bam -S ZF-91-OH77.bam

job started at Wed Oct 11 17:21:16 2017 on nrusca-sld0136 -- 3aa708f4-314f-4a46-aed3-6dad44e1df21

pid: 120635, system: Linux 2.6.32-696.6.3.el6.x86_64 #1 SMP Fri Jun 30 13:24:18 EDT 2017 x86_64

cell_tag : None

chrom : None

compresslevel : 6

detection_method : None

gene_tag : None

gene_transcript_map : None

get_umi_method : read_id

ignore_umi : False

in_sam : False

log2stderr : False

loglevel : 1

mapping_quality : 0

method : directional

no_sort_output : False

out_sam : False

output_bam : True

output_unmapped : False

[geru1@rogue scripts]$ ^C [geru1@rogue scripts]$ tail -20 group_nodedup_log.txt 2017-10-11 17:24:50,860 INFO Written out 2780000 reads 2017-10-11 17:24:50,861 INFO Written out 2790000 reads 2017-10-11 17:24:50,861 INFO Written out 2800000 reads 2017-10-11 17:24:50,861 INFO Written out 2810000 reads 2017-10-11 17:24:50,861 INFO Written out 2820000 reads 2017-10-11 17:24:51,042 INFO Written out 2830000 reads 2017-10-11 17:24:52,048 INFO Written out 2840000 reads 2017-10-11 17:24:53,025 INFO Written out 2850000 reads 2017-10-11 17:24:54,201 INFO Written out 2860000 reads 2017-10-11 17:24:55,400 INFO Written out 2870000 reads 2017-10-11 17:24:56,606 INFO Written out 2880000 reads 2017-10-11 17:24:57,704 INFO Written out 2890000 reads 2017-10-11 17:24:58,636 INFO Written out 2900000 reads 2017-10-11 17:24:59,605 INFO Written out 2910000 reads 2017-10-11 17:25:00,558 INFO Written out 2920000 reads 2017-10-11 17:25:01,529 INFO Written out 2930000 reads 2017-10-11 17:25:02,729 INFO Written out 2940000 reads 2017-10-11 17:25:03,302 INFO Written out 2950000 reads 2017-10-11 17:25:04,129 INFO Written out 2960000 reads 2017-10-11 17:25:04,696 INFO Written out 2970000 reads

3) samtools quickcheck input.bam (nothing is return here)

4) samtools view input.bam | head M00145:250:000000000-AV6RL:1:2109:19889:20270_AGCGATGGCCGG 129 chr1 21198 0 127M chr19 32891088 0 CGGTGCTCCCCACTCCACTGCCAGTCATCACTGGCTCTCCCTTCCCTTCATCCTCGTTCCCTATCTGTCACCATTTCCTGTCGTCGTTTCCTCTGAATGTCTCACCCTGCCCTCCCTGCTTACAAGT HGGGGGHHHGGHGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGHGHHHHHHHHHHHHHHEHHFHGHHHHGFGGGGGHGHHGHGEHHHHHHHHHHHGHHGHHGGHGEHHHHHHHF NM:i:2 AS:i:117 XS:i:117 RG:Z:1 M00145:250:000000000-AV6RL:1:1105:24775:5278_TCTCACGGCTTC 129 chr1 24400 0 126M chr15 101966263 0 CTGCCTTGCGCACGAGCACTGCTGGGTAAATATTTGTTGGCTGCAGGAAAACGTGAAGGAATAGGCCCTCCAATGGGAGGAAAAGCATGAGTTGGGAGAGCAGAGCCACCACAGGAAACCAGGAGG HH44BFFGBE2EEGGGGGHHHGGEHGCGHGFHGHHHHHAGHHGGBG1?GHBG0FAHGHHHF3GHHHGFHHGFF4FB00?EEGF?FCEBEBFGBG3</?/BFF0CAG00GE/GEHAGHBG</F?/F? NM:i:2 AS:i:116 XS:i:116 RG:Z:1 M00145:250:000000000-AV6RL:1:2109:3969:19537_TCTCACGGCTTC 129 chr1 24400 0 126M chr15 101966263 0 CTGCCTTGCACACGAGCACTGCTGGGTAAATATTTGTTGGCTGCAGGAAAACGTGAAGGAATAGGCCCTCCAATGGGAGGAAAAGCATGAGTTGTGAGAGCAGAGCCACCACAGGAAACCAGGAGG HHHHHHHHFHHGHGGGGGHHHHHHHGGFGGGHHHHHFHHGFGGAGHHGHFHHGGHHGHGGHFHHHGHGHFGFHE@3FFFCFHGGHHHFHGHHHHHDFH?GHHHHHGHGHHHFHGHCGHHHGHGECF NM:i:0 AS:i:126 XS:i:126 RG:Z:1 M00145:250:000000000-AV6RL:1:1101:18438:10582_GGCAGCACAGTG 65 chr1 24606 0 73M chr15 101966283 0 CACACAGGGAAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCT GGGHHGHHGGHGHHHHHHHHHHGHHHHHGGHHHGGGGGHHHHHHGHGGGGGGGGHHHGHHHHGGGGFGHHHHH NM:i:0 AS:i:73 XS:i:73 RG:Z:1 M00145:250:000000000-AV6RL:1:2108:6457:14801_GGCAGCACAGTG 65 chr1 24606 0 73M chr15 101966283 0 CACACAGGGGAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCT GAEGHEHFGG?GEGHHHHHFHFGHHHHHGFEEHGGGCECFHEHHGHGGGGDGGGHGGEFHGEGG?FCEFGHFH NM:i:1 AS:i:68 XS:i:68 RG:Z:1 M00145:250:000000000-AV6RL:1:2109:11990:25142_GGCAGCACAGTG 65 chr1 24606 0 73M chr15 101966283 0 CACACAGGGAAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCT GGGHHGHHGGHGHHHHHHHHHHGHHHHHGGHHHGGGGGHHHHHHGHGGGGGGGGHHHGHHHHGGCFDDGHHGH NM:i:0 AS:i:73 XS:i:73 RG:Z:1 M00145:250:000000000-AV6RL:1:2106:25375:14190_TAGTAACACCGC 129 chr1 26335 0 128M chr15 101964499 0 GCAAGTTTGCTGGATGTCCTAACTTATTTCTGTGCCTCAGTTCTCCCATATGTAAGATCACAAAGGGGGTAAAGATGCAAGATATTTCCTGTGCACATCTTCAGATGAATTCCTTGTTAGTGTGTGTT HHHGHHHHGHHHHHHHHHHHHHHHHGHHHHHHHHIHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFGGGGHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHGHGGHFHHHHHHHGHHHHHH NM:i:3 AS:i:117 XS:i:117 RG:Z:1 M00145:250:000000000-AV6RL:1:2107:14840:17062_TAGTAACACCGC 129 chr1 26335 0 128M chr15 101964499 0 GCAAGTTTGCTGGATGTCCTAACTTATTTCTGTGCCTCAGTTCTCCCATATGTAAGATCACAAAGGGGGTAAAGATGCAAGATATTTCCTGTGCACATCTTCAGATGAATTCCTTGTTAGTGTGTGTT HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHHHHHGHGHHHHHHHHHHHHHHFHGGCEEHGHGHHHHHHHHHHHHHHHHGHHHHGHHHHHHHHHHHHHHHHHHHHGGHDGHFHHHH NM:i:3 AS:i:117 XS:i:117 RG:Z:1 M00145:250:000000000-AV6RL:1:2113:21381:12006_TAGTAACACCGC 129 chr1 26335 0 128M chr15 101964499 0 GCAAGTTTGCTGGATGTCCTAACTTATTTCTGTGCCTCAGTTCTCCCATATGTAAGATCACAAAGGGGGTAAAGATGCAAGATATTTCCTGTGCACATCTTCAGATGAATTCCTTGTTAGTGTGTGTT HHCFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHGHHHHHHFHHGHHHHHHFHHHHHFGHEGGGEGFHGHHFHHHHGHHHHHHGHHHHHGHHGHFHHHGHHHHHHHHHHHHHHHHHHHHHHHH NM:i:3 AS:i:117 XS:i:117 RG:Z:1 M00145:250:000000000-AV6RL:1:2114:9090:27249_TAGTAACACCGC 129 chr1 26335 0 128M chr15 101964499 0 GCAAGTTTGCTGGATGTCCTAACTTATTTCTGTGCCTCAGTTCTCCCATATGTAAGATCACAAAGGGGGTAAAGATGCAAGATATTTCCTGTGCACATCTTCAGATGAATTCCTTGTTAGTGTGTGTT HGCHHHGHFHGGGAGGFFDGGGFFHGEHHEFFGHGDFFGHHGFHH53FGFBEDEEGGBGFFDFFFHGGCEGHHHHHHGFHHGFFH4GFFHHFGBGHHF3FGHFGFGFGFHFDH4F3334B4??FGBBC NM:i:3 AS:i:117 XS:i:117 RG:Z:1

5) input.bam size=167M

Thanks

Mei

IanSudbery commented 6 years ago

I'm assuming 1GB is a typo and you mean 1TB?

This is not something we have seen before, but usually a segmentation fault means either you are running out of memory (unlikely if you have 1TB, and it failed that quick) or a problem in C-code, which would probably suggest one of our dependencies!

For the dependencies, it could be a conflict or a bad version in your setup (sepcific to your setup) or a problem of the interaction between your input and the C-code (specific to your input).

Can you check whether you can run the test data in the QUICK_START guide? That should hopefully tell us whether its a general problem or specific to the input.

We should probably rule out memory by prefixing the command with time -v and reporting the max memory usage.

mei2000 commented 6 years ago

sorry, it is a typo, the server has 1TB RAM.

TomSmithCGAT commented 6 years ago

Hi @mei2000 - The QUICK_START guide can be found here. For the testing purposes here, you can skip straight to step 5 and run the following commands

wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam
umi_tools dedup -I example.bam --output-stats=deduplicated -S deduplicated.bam
IanSudbery commented 6 years ago

The BAM will need indexing first. I.e.

wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam
samtools index example.bam
umi_tools dedup -I example.bam --output-stats=deduplicated -S deduplicated.bam
TomSmithCGAT commented 6 years ago

Good point. Thanks!

mei2000 commented 6 years ago

Hi Ian,

I just run the downloaded test bam file from your github site, and I don’t see any error message from the command line. Your bam file size is about 20MB and my bam file is about 70MB.

Here is the command line: umi_tools group -I example.bam --paired --group-out=groups.tsv -L group_log.txt --output-bam -S mapped_grouped.bam

Thanks

Robin

From: Ian Sudbery notifications@github.com Reply-To: CGATOxford/UMI-tools reply@reply.github.com Date: Friday, October 13, 2017 at 5:37 AM To: CGATOxford/UMI-tools UMI-tools@noreply.github.com Cc: "Ge, Robin" robin.ge@novartis.com, Mention mention@noreply.github.com Subject: Re: [CGATOxford/UMI-tools] get the Segmentation fault from umi_tools group command line (#193)

The BAM will need indexing first. I.e.

wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam

samtools index example.bam

umi_tools dedup -I example.bam --output-stats=deduplicated -S deduplicated.bam

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_CGATOxford_UMI-2Dtools_issues_193-23issuecomment-2D336402915&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=gJyRJMT77Ocx04AusJOp_Y12i6_6vDEfcqJOkvhC_14&m=9-ph9qucCgN1Tn2X_OSiKG1EpKLcUa3uqLVg703n_nM&s=jwzO_lAfRMYwen_kGb9O1l4ROpBYTohlW8qWOKgMQCE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AfOVaOz1FazgJAauLDqDBOFNyLrRRia9ks5sry9agaJpZM4P3FSg&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=gJyRJMT77Ocx04AusJOp_Y12i6_6vDEfcqJOkvhC_14&m=9-ph9qucCgN1Tn2X_OSiKG1EpKLcUa3uqLVg703n_nM&s=YwSHVioVIoOyRZAw8FwCwBCrNRioDjF17XeC6cOFmcU&e=.

TomSmithCGAT commented 6 years ago

Hi @mei2000 - The good news is the installation appears to have worked OK if you can run the test bam file. The bad news is this means there's something unexpected in your BAM.

In order to work out exactly what the problem is, the best approach is to reduce the BAM down to a more manageable size which still reproduces the error. From the final line in the logfile, it looks like the issue occurs between read 2970000-2980000. From your failed command on your BAM, you should have a partial output BAM called output.bam. You can create a new minimal BAM which starts from the last read in output.bam and contains the next 10000 reads - samtools view [region] output.bam > minimal.bam, where region takes the form "contig:start-end", e.g "chr1:1-10000". Hopefully, you should get the same error with this minimal BAM. By repeating this process, you may even be able to narrow the error down to a single read/read pair. If you're OK to share you data, you can also email me this minimal BAM and I can try and hunt down the issue (tss38@cam.ac.uk)

IanSudbery commented 6 years ago

Hi @mei2000 : did you say that your BAM was 167M reads, but only 70MB on disk? I feel like 167M reads should take more than 70MB of disk space.