cancerit / BRASS

Breakpoints via assembly - Identifies breaks and attempts to assemble rearrangements in whole genome sequencing data.
GNU Affero General Public License v3.0
57 stars 20 forks source link

Error in get_abs_bkpts_from_clipped_reads.pl #95

Closed udvzol closed 4 years ago

udvzol commented 4 years ago

Hi,

I am getting the following error in Implement_filter.abs_bkp.err file.

I know the immediate reason, which is that if in the bam file the RNEXT filed is "*" then the mate_seq_id of that read is not set. My problem is, that I don't know why it happens, and how to resolve this issue. If I naively ignore these reads, it seems to produce messed-up results.

Could you please help me with this problem? Thanks!

+ /usr/bin/perl get_abs_bkpts_from_clipped_reads.pl -fasta refgenome/GRCh38.d1.vd1.fa -out output2/S_16_1035_S47603_vs_S_16_1034_S47625.r4 merged_bam/1035.bam output2/S_16_1035_S47603_vs_S_16_1034_S47625.r3
Collecting supporting read pairs for each region...
Use of uninitialized value in string eq at get_abs_bkpts_from_clipped_reads.pl line 625.
Command exited with non-zero status 2
0.26user 0.04system 0:00.35elapsed 86%CPU (0avgtext+0avgdata 40808maxresident)k
0inputs+8outputs (0major+9361minor)pagefaults 0swaps
udvzol commented 4 years ago

@keiranmraine Could you please look into this? I hope, I am just missing something trivial and you could help quickly. Many thanks!

keiranmraine commented 4 years ago

What did you use to map the data? The area of the code you are having issues with has only been extensively used by bwa-mem mapped data.

udvzol commented 4 years ago

I used the original bwa mem unfortunately not the bwa-mem.pl script you wrote.

keiranmraine commented 4 years ago

I think this is a data edge case issue. The only way to really diagnose this is to add a set of print statements to output the values of all the variables used in this block:

https://github.com/cancerit/BRASS/blob/6496ff0cc58eccab89bf0ca1570f2baed8c1c6a8/perl/bin/get_abs_bkpts_from_clipped_reads.pl#L604-L624

All the values passed into the function before those used within the grep. It is certainly going to be one of the $_->mate_* or $_->mstrand calls being applied on a read whose mate is not mapped.

Unfortunately the person who wrote this code has long since left the group.

Are you able to produce a minimal test set that you can share that triggers this? It can be just data for input to this specific program. .r3 + minimum .bam.

udvzol commented 4 years ago

Thank you for looking into this.

I made a minimal input to reproduce the error. The 96th line of the bam file is the first one producing this error. I attach the files here: minimum.zip

The problem is: $_->mate_seq_id is missing, as RNEXT in the bam file is *.

I assume that these reads are meaningful for SV detection since the reads not mapping in the proper distance are the ones contributing to SVs. Am I correct? If I remove them I lose all the events at the end.

keiranmraine commented 4 years ago

RNEXT = * indicates the mate read is not mapped. I don't think you'd want to remove them all as the same function is used to find singleton reads. It could be that there are no good events but I will look to see if I can at least make this progress without error on your data and confirm it works as expected under our internal test data.

udvzol commented 4 years ago

Thank you for the fast steps!

I tried to rerun my analysis with the new version. Now that error has disappeared but a lot of other warnings came up.

In Implement_filter.abs_bkp.err for all records of r3 I got these warnings:

In Implement_filter.remap_micro.err for all records I got these warnings:

I assume it is because with your modification all reads that were not mapped properly are ignored. Is this the expected behavior?

If I manually look at read names from the groups/r2/r3/r4 files and search the regions in the brm.bam I indeed don't find the exact same read names. They overlap but here are also different names.

And I also noticed that there is a similar collect_reads_by_region function in filter_with_microbes_and_remapping.pl which doesn't give error on undefined RNEXT because it filters by read name.

I really tried to understand where the problem is but for me, it seems that the read names in groups.filtered.bedpe_nohead are incorrect. Have you any idea what is wrong, or is it normal that I get these warnings for all of my records?

Thanks again!

keiranmraine commented 4 years ago

Do you see this in all samples or have you only run one? The filtering code is not perfect and it is quite frequent to see no reads being found as well as no output at the very end.

Poor quality or low coverage area a very common cause of no data.

We have processed thousands of samples generated here and as part of PanCancer and PanProstate global projects with this flow.