bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Error processing plasmid reads #1

Closed sabiqali closed 2 years ago

sabiqali commented 2 years ago

Hi, I am working on counting repeats in ONT plasmid data that we have. Having recently read your paper, we decided to use this method but the yield and the result is not what we are expecting. This is a truth set, so we roughly know how much the count should be.

One of the two major problems we are facing is that it throws the following error consistently.

problem getting seq1 b9435da3-21e3-4de8-aee4-393700401474 ('pBluescript', 762, 821, 'CAG') 682 None 359
problem getting seq1 90bf2650-66ed-4e94-8fb3-5c777860bbc2 ('pBluescript', 762, 821, 'CAG') None 901 None
problem getting seq1 8d4f73e1-508c-4d70-93d9-c010b03a3524 ('pBluescript', 762, 821, 'CAG') 682 None 270

We thought this might be because of a problem related to the BAM file, but we checked that and regenerated that to make sure that the BAM file makes sense,

Further, we thought that maybe providing a FASTA file might aid in retrieving the read and getting a count. So, we provided a FASTA file using the following option.

--reads_fasta

But that led us to another error, this time a TypeError.

The key issue is that the first method mentioned, using the BAM file, yields only ~1200 reads that are processed whereas there are ~200,000 reads that are present.

We would appreciate some help in understanding how to tackle this issue.

Thank you!

P.S. 'pBluescript' is the name of the reference, it is a custom reference built for the purpose of this experiment.

wdecoster commented 2 years ago

Hi,

I am seeing similar messages:

problem getting seq1 6f5799d3-0dab-4781-9c8a-1b2acfdd328c ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 db759316-772f-43a8-96b4-5c65e9b5c7bf ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 38452027-c315-4953-8cbf-749865e335c9 ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 b323d29c-5ee0-4332-836d-a58ce4781ed2 ('chr1', 41888946, 41889227, 'ACCACACAC') 41888866 None 6841
problem getting seq1 4d348e11-3553-4a7e-8fdb-a9bc2657dada ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 56792abd-7621-4a37-ae6b-5f656c1b379c ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 0fc174c2-7f05-48a2-bda5-ffa7d91860f9 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 34284a37-289c-43f0-a8d3-2735d130fa76 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835078 None
problem getting seq1 12505b6d-8161-4067-90a4-e453ba679197 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 be27c87a-371d-4401-825d-343b1a7a9a8b ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 ad03b79d-8174-4724-aa45-ed4b3ca86556 ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 87a75701-b485-4007-bdcb-6a1f5996a838 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 9f724076-26fc-4d00-a75c-baff366421e5 ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 13d51f19-2d23-4dd1-85b9-aee5c7708494 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 fe919ef7-bd90-46b1-a206-2827b94012ea ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None
problem getting seq1 5ad2b905-04f1-4a82-9a56-be349d96e2ee ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 317e66df-6da4-4972-9ea4-82c4da030f09 ('chr1', 53129178, 53129209, 'AAAAAAGAAAAA') None 53129289 None
problem getting seq1 5ac46194-b3e0-4fbc-969a-de772267c3d6 ('chr1', 24834795, 24834996, 'GAGAGGGAGAGGGAGACGG') None 24835076 None

Would these perhaps be from supplementary alignments for which I believe aln1.query_sequence is None?

readmanchiu commented 2 years ago

Hi guys,

Wonder if you guys used minimap2 for your alignments? I would recommend using minimap2 -Y to do the alignment, -Y will force it to use soft-clipping for supplementary alignments and hence the read sequences can be readily extracted from the BAM file. That would be the easiest solution, unless you have reasons to not use minimap2.

Yes, the error messages are because the aligners did not output the read sequences for the supplementary alignments. However these messages are just warnings because the primary alignment SAM record should have the read sequence and the script should be able to store and use that for the read sequence.

So @sabiqali, I guess you expect most of your reads contain the repeat expansion? Or could it be most of your reads contain the non-expanded allele and only a small fraction contain the expansion? Like I said, I would suggest re-run the alignment with minimap2 -Y to check.

The --reads_fasta option was also implemented for cases the aligners use hard clipping so that entire read sequences were not retrievable from the BAM file. The FASTA supplied needs to be indexed (with samtools faidx). @sabiqali, is the FASTA indexed?

Anyways, I'll double-check on my side to see if --reads_fasta is working properly and that these error messages are just warnings.

Thanks for checking out the software and raising the issues.

sabiqali commented 2 years ago

Hi,

We are using minimap2 for our alignments, and we did experiment with the `-Y as we thought the soft-clipping might solve the issue. But, it was to no avail and it still gives out the error messages.

Also, regarding the data itself, while the expanded allele might not be present in all the reads, the number of reads that were counted by straglr was almost an order of magnitude less that than the number of reads that I was expecting to be counted.

Also, yes, if you could recheck the --reads_fasta option and the error messages, that would be great!

Thanks!

readmanchiu commented 2 years ago

Hi @sabiqali

Wonder what's causing the misses...W

Would you mind sending me a couple of missed reads so I can test it out? if you want to send me the pBluescript reference too that would help. I think the will be small enough you can send to my e-mail (rchiu@bcgsc.ca)

I wonder if there could be some idiosyncrasies with the short reference that Straglr is not equipped to handle. Just a thought, may be if the expansion is substantially longer than the reference that not all split alignments were produced by Minimap2.

readmanchiu commented 2 years ago

@wdecoster, are you getting expected results? do some of the reads in the error messages also show up in the output carrying the expansion?

sabiqali commented 2 years ago

Hey @readmanchiu

I do not think I will be able to share this particular dataset, but we shall try and replicate the same error in an open source dataset and hopefully that will help.

Thank!

wdecoster commented 2 years ago

Wonder if you guys used minimap2 for your alignments? I would recommend using minimap2 -Y to do the alignment, -Y will force it to use soft-clipping for supplementary alignments and hence the read sequences can be readily extracted from the BAM file. That would be the easiest solution, unless you have reasons to not use minimap2.

Yes, we use minimap2. But the entire cohort has already been aligned without -Y. I am not going to rerun all of these. If you say it is just a warning and the results should be okay then I am not going to bother. I would suggest adding this to the README.

@wdecoster, are you getting expected results? do some of the reads in the error messages also show up in the output carrying the expansion?

It hasn't got that far yet. It just terminated after printing trf input <filename.dat>. I noticed in the code that you send stderr and stdout to /dev/null and also don't use the returncode. So that seems like a source of errors. I'll see if I have time this week to test this further.

readmanchiu commented 2 years ago

sorry @wdecoster, missing read sequences in the bam do actually lead to missed read. I would suggest re-running with the read sequences supplied using --reads_fasta. The reads can be in FASTA or FASTQ format, need to be compressed with bgzip and indexed with samtools faidx.

@sabiqali, I tested the --reads_fasta parameter, discovered a little bug and updated the GitHub repo with the fix. Not sure if it's the same error you saw, but It should be working now.

Please update your code to use the --reads_fasta option

wdecoster commented 2 years ago

Our fastq reads are in multiple fastq files, so that doesn't seem to be an option.

readmanchiu commented 2 years ago

Unfortunately the script needs to access the read sequences to validate the repeat sequences, either through the BAM or FASTQ. You can decompress, concatenate, compress, and index the multple fastq's into a single file. I understand It will take some time to generate this, or it may just be easier to do the re-alignment with minimap2 with the -Y option.

wdecoster commented 2 years ago

So if I understood correctly this problem is specifically for supplementary alignments, of which the sequence is already present in the BAM file as the primary alignment, just in another location?

readmanchiu commented 2 years ago

The issue is Straglr needs to access the read sequence to validate the candidate repeat sequence when it iterates through each alignment record in the BAM file. Very often when the repeat is long, it will give rise to split/clipped/supplementary alignments. If the aligner is only soft-clipping, like telling minimap2 to run with -Y, the entire read sequence will be kept within the alignment record. If the aligner is hard-clipping, only the aligned sequence is kept with the alignment record. In that case, providing the indexed fastq will still allow Straglr to access the intact read sequence.