crimBubble / ECCsplorer

The ECCsplorer is a bioinformatics pipeline for the automated detection of extrachromosomal circular DNA (eccDNA) from paired-end read data of amplified circular DNA.
GNU General Public License v3.0
18 stars 5 forks source link

Problems on idx file, BLAST+ database file and " No high (potential) confidants eccDNA candidate regions found" #16

Open hanssli opened 2 months ago

hanssli commented 2 months ago

@crimBubble @molbio-dresden Hope you are well down. I am trying to use Eccsplorer to find eccDNA from my own data, but get info " No high (potential) confidants eccDNA candidate regions found". During the process, I got some errors and confusion:

  1. I am not sure if the idx file was created by Eccsplorer or need I pasted one into the "reference_data" folder. When I run it first time, it told me there is no idx file. And I downloaded one from Ensembl, and rename it to the name as it asked. I got the errors "warning: index does not contain md5 key." and "[E::sam_hrecs_error] Malformed key:value pair at line 158: "@SQ " samtools view: failed to add PG line to the header" as the following screenshot: Screenshot from 2024-04-20 09-33-49

  2. I am not sure what type of BLAST+ database file the Eccsplorer need. I indicated one for it, but It also use another dna_database_masked.fasta as blow: Screenshot from 2024-04-20 09-37-30 Is there any wrong for my steps? or can I safety ignored it?

  3. I can got many eccDNA candidate regions by CircleMap, but can't get one from Eccsplorer, so I think there must be something wrong, but I don't know what's wrong. Screenshot from 2024-04-20 09-44-54

hanssli commented 2 months ago

@molbio-dresden @crimBubble , Hope there have a detailed instruction on the files.

Thanks and best wishes, Han

crimBubble commented 2 months ago

Hi @hanssli thank you fro trying out ECCsplorer.

  1. The index file for segemehl should be created by segemehl (which is part of our pipeline). Usually it should be created when it is not found. You do not need to add it by yourself, doing so might result in later errors.

  2. The 'dna_database_masked.fasta' is the internal RepeatExplorer database which is always used for classification by default. Your additional database should be in fasta format. From what you posted it seems to work as expected.

  3. Not finding any candidates might be a result of the incompatible idx file you provided. Try again and let ECCsplorer (segemehl) create the index file. Depending on your genome this might take some time.

I hope this helps.

Additionally, as you seem to work with human data. Be aware, that segemehl is very slow on large genomes and needs a lot of memory (RAM). You migth need to run each chromosome separatly. If you have control data you might just run the clustering module alone and try our new 2. circles assembly workflow. Further, I would expect you finding more candidates unsing CircleMap as it uses a less conservative approach.

Best, LM

hanssli commented 2 months ago

Hi @crimBubble, Thanks for your help. If the idx file not created by segemehl, how to solve the problem? My computer has 32 G RAM, it seems it's completed sucessfully.

Thanks and best wishes, Han

crimBubble commented 2 months ago

Hi @hanssli , If the idx file is not created automatically you can try to create it manually using: segemehl.x -x <genome.fasta>.idx -d <genome.fasta> and place it in the reference_data folder. Make sure to delete all previous results if you rerun otherwise steps might get skipped if there is an output found even if it is a faulty one.

The segemehl documentation suggests at least 128GB of memory for mapping against the human genome iirc. So you need to split the genome into at least single chromosomes.

From the log you posted segemehl is started at Sat Apr 20 08:55:48 starting with 10 threads and finishes in under 1 second. This indicates that it did not run at all and therefore you get no candidates.

Best, LM

hanssli commented 2 months ago

Hi @crimBubble , Thanks for pointing out the problem. I did not find it before. That's maybe the question. I will try to do that by spliting the genome into sigle chromosomes. I will also try to do use the 'repeats_and_circles_assembly' tool later.

Have a nice day! Thanks and best wishes, Han