PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

How critical is the "Non-unique read group integer IDs found" warning in running ipdSummary? #98

Open Marjan-Hosseini opened 8 months ago

Marjan-Hosseini commented 8 months ago

I am trying to get the ipdRatio and modified bases information from a PacBio bam file which is sorted and indexed. The command I'm running is:

ipdSummary sorted_chr21_0_100000.bam --reference $REF_path/$REF_FILE_NAME.fasta --referenceWindow 20:0-100000 --identify m6A,m4C,m5C_TET --methylFraction --outfile sorted_chr21_0_100000;

I get the following warning message:

RuntimeWarning: Non-unique read group integer IDs found - some features may be restricted warnings.warn("Non-unique read group integer IDs found - some features may be restricted", RuntimeWarning)

My question is which types of features might be restricted if my read groups are not unique?

I am not sure how critical it could be.

natechols commented 8 months ago

It's probably not critical - it was part of a fix for working with multiplexed data, and I was concerned about possible corner cases so I made it output an obnoxious warning. However, are you running a multi-sample BAM? It's unusual to see that warning unless you're combining barcodes.

Marjan-Hosseini commented 8 months ago

@natechols Thank you for your response! This is the reason my read groups messed up: I once split my large bam file, and because the size of the smaller bams were not small enough I had to merge the bam file again to split. This process for some reason made read groups be in the format of:

@RG     ID:5b57ecf9-62574588    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000
@RG     ID:f82bacd0-7B8A4FFF    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000
@RG     ID:9af21d85-1080A37D    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000

This is part of the read groups. As a result, ipdSummary gave me this error:

ValueError: invalid literal for int() with base 16: '5b57ecf9-1A103DBB'

which was from this part of the code:

pbcore/io/align/_BamSupport.py", line 70, in rgAsInt
    return np.int32(int(rgIdString.split("/")[0], 16))

Because my read groups that now have dashes cannot be converted to int. So in my local installation, I modified line 70 to:

return np.int32(int(rgIdString.split("-")[0], 16))

and that solved the problem but I got the warning because now my read groups are not unique. I don't know why merging bam files by 'samtools' made such read groups. The first splitted files do not have such a format of read groups.

natechols commented 8 months ago

Do not use samtools to merge PacBio BAMs if you want the output to work in downstream PacBio applications. Instead, use the pbmerge utility that is included with SMRT Link, and also distributed here: https://github.com/PacificBiosciences/pbtk (I think it's buried in the install tree in smrtlink install, I can dig up the exact path later). This will ensure that your output is compatible with our reader functions.