detect from bam file ? - Githubissues

crimBubble / ECCsplorer

The ECCsplorer is a bioinformatics pipeline for the automated detection of extrachromosomal circular DNA (eccDNA) from paired-end read data of amplified circular DNA.

GNU General Public License v3.0

18 stars 5 forks source link

detect from bam file ? #2

Closed njaupan closed 2 years ago

njaupan commented 3 years ago

Hi,

I am trying to benchmark your tool with circle-map, circle_finder and etc on a dataset of large plant genome (17G).

However, in the innitial step of index reference genome, segemehl produced large index file on disk: ~ 200G after 12 hours. It later broken on a cluster with 96 CPUS and 496 GB memory, so the aligner did not work well for large genome. I have noticed that ECCsplorer uses haarz to detect split read from bed file produced by segemehl, I have tested the bam produced from different aligner such as BWA but it could not compile because they won't produce the split bed file.

Do you think it can be resloved so that ECCsplorer works on other bam?

Best, panpan

crimBubble commented 3 years ago

Hi panpan,

thanks for your feedback.

Currently the ECCsplorer pipeline only works with the segemehl output. We found that the split read detection using segemehl/haarz is more accurate than using other aligners e.g. bwa-mem combined with circle-map's SR detection. Unfortunately, segemehl is very RAM intensive especially on large genomes. It is currently not planned to introduce a new stand-alone SR detection algorithm to the ECCsplorer pipeline. For now, the only option to analyze large genomes is to run them section by section.

njaupan commented 3 years ago

Segemehl failed int the alignment so run it section by section won't work either, but many thanks.

Best, panpan

crimBubble commented 3 years ago

The alignment likely failed because of not enough available RAM (yes even with ~500 GB). So running it in smaller parts should help. I think the RAM usage of segemehl migth increase exponentional with genome size.

Best, Ludwig