how to use smrtsv2 on a aligned bam file

masen1991 commented 3 years ago

hello , how can i use smrtsv2 on a aligned bam? i see https://github.com/EichlerLab/smrtsv2/issues/41 ,and i do this: i try to vim a align/alignments.fofn in dir (which have folder as align/ reference/),and i already try smrtsv2 index reference. in align/alignments.fofn,i only have this line: /abs_path/all_20K_40K_combine_merged_nglmr_b38.sorted.bam and when i try smrtsv detect or smrtsv run it all get this : Detecting variant signatures ValueError in line 31 of ~/bio/smrtsv2/rules/detect.snakefile: invalid literal for int() with base 10: 'all_20K_40K_combine_merged_nglmr_b38.sorted' File "~/bio/smrtsv2/rules/detect.snakefile", line 31, in

can you help me.

Thanks.

paudano commented 3 years ago

The error comes from this line: DETECT_BATCH_LIST.append(int(os.path.basename(line).rstrip('.bam')))

SMRT-SV expects the BAM files to have an integer name (batch name). Create a symlink named 0.bam that points to your BAM file, then place the path to that symlink in "align/alignments.fofn".

I'm offering best-effort support on SMRT-SV right now. I think there are better variant callers now, and I hardly even use it myself, though it may still be useful to get supporting calls for a better callset. Let me know if you have other problems though.

masen1991 commented 3 years ago

Hello @paudano ,thanks for the quick reply. i try what you said,and there is a little wrong ,i need to symlink 0.bam in align/bam/0.bam or it will go wrong. thanks again ,and i am running on detect step now,hope it can go well

masen1991 commented 3 years ago

hello @paudano when i run smrtsv detect,it still go wrong

Error in rule detect_gaps_search: jobid: 26 output: detect/gaps/batch/0_all.bed log: detect/gaps/batch/log/0.log

RuleException: CalledProcessError in line 489 of /picb/humpopg-bigdata5/masen/bio/smrtsv2/rules/detect.snakefile: Command ' set -euo pipefail; samtools view -F 0x4 -q 30 align/bam/0.bam | python2 -s /picb/humpopg-bigdata5/masen/bio/smrtsv2/scripts/PrintGaps.py reference/ref.fasta /dev/stdin --condense 20 >detect/gaps/batch/0_all.bed 2>detect/gaps/batch/log/0.log ' returned non-zero exit status 1. File "/picb/humpopg-bigdata5/masen/bio/smrtsv2/rules/detect.snakefile", line 489, in __rule_detect_gaps_search File "/picb/humpopg-bigdata5/masen/bio/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run Removing output files of failed job detect_gaps_search since they might be corrupted: detect/gaps/batch/0_all.bed Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

any idea?

paudano commented 3 years ago

Look it: detect/gaps/batch/log/0.log

The error that caused the rule to fail should appear there.

masen1991 commented 3 years ago

Hello @paudano Traceback (most recent call last): File "/picb/humpopg-bigdata5/masen/bio/smrtsv2/scripts/PrintGaps.py", line 449, in if tPos > fai[chrName][0]: KeyError: 'chr1_KI270706v1_random' do it mean alt in ref maybe cause this bug? maybe is there any scripts can rectify the difference from many kind of GRCh38 ref(with ALT or maybe chr prefix or chrM to chrMT),it all may cause bug and it really take a long time to find this bug.

paudano commented 3 years ago

I think the contig alignments are against a different reference than what you are pointing SMRT-SV to. "chr1_KI270706v1_random" must be in the BAM but not in the reference FASTA or FAI.

My apologies for the delayed response.

LYC-vio commented 2 years ago

Hello, I'm also trying to use smrtsv2 on an aligned bam file (NA24385 PacBio CCS data aligned against hg19 reference using NGMLR) with the method from #41, however, it reports an error in the assemble_polish step. The error was from smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py and according to the error message, it failed because the aligned bam file does not have a suitable header (current failed on the @ RG) .

According to BamIO.py:

def _loadReadGroupInfo(self):
        rgs = self.peer.header["RG"]
        readGroupTable_ = []

        # RGID -> ("abstract feature name" -> actual feature name)
        self._baseFeatureNameMappings = {}
        self._pulseFeatureNameMappings = {}

        for rg in rgs:
            rgID = rgAsInt(rg["ID"])
            rgName = rg["PU"]
            ds = dict([pair.split("=") for pair in rg["DS"].split(";") if pair != ""])
            # spec: we only consider first two components of basecaller version
            # in "chem" lookup
            rgReadType = ds["READTYPE"]
            rgChem = "unknown"
            rgFrameRate = 0.0
            rgSample = rg.get("SM", "UnnamedSample")
            if rgReadType != "TRANSCRIPT":
                if set(("BASECALLERVERSION", "SEQUENCINGKIT", "BINDINGKIT")) <= set(ds):
                    pass
                else:
                    raise IOError("This does not appear to be a valid PacBio BAM file. Only datasets from RS II and Sequel instruments are supported by this program.")
                rgFrameRate = ds["FRAMERATEHZ"]
                basecallerVersion = ".".join(ds["BASECALLERVERSION"].split(".")[0:2])
                triple = ds["BINDINGKIT"], ds["SEQUENCINGKIT"], basecallerVersion
                rgChem = decodeTriple(*triple)

            # Look for the features manifest entries within the DS tag,
            # and build an "indirection layer", i.e. to get from
            # "Ipd"  to "Ipd:Frames"
            # (This is a bit messy.  Can we separate the manifest from
            # the rest of the DS content?)
            baseFeatureNameMapping  = { key.split(":")[0] : key
                                        for key in ds.keys()
                                        if key in BASE_FEATURE_TAGS }
            pulseFeatureNameMapping = { key.split(":")[0] : key
                                        for key in ds.keys()
                                        if key in PULSE_FEATURE_TAGS }
            self._baseFeatureNameMappings[rgID]  = baseFeatureNameMapping
            self._pulseFeatureNameMappings[rgID] = pulseFeatureNameMapping

            readGroupTable_.append(
                (rgID, rgName, rgReadType, rgChem, rgFrameRate, rgSample,
                 frozenset(baseFeatureNameMapping.iterkeys())))

it required many field that wasn't provided in the aligned bam file, and as mentioned in #24, smrtsv2 need the PacBio subread bam files for the assembly polish step. Does that mean smrtsv2 can not run on the bam file which is not in PacBio BAM format? I'm kind of confused because some datasets only provide the fasta/fastq file, and I'm not able to get the RG information from the aligners (e.g. the bam header from minimap2 only have HD, SQ and PG). Is there any way to solve this problem?

Thank you

EichlerLab / smrtsv2

how to use smrtsv2 on a aligned bam file #56