Open JunmingH opened 4 years ago
Dear @JunmingH,
would you have the complete command line that triggered this error?
Regarding the time constraints - two days seems pretty long. But the time required depends on data and command line used.
Cheers, Tobias
Hi Tobias,
Attached is the script. I ran it with around 300 samples. But there is no chance to run it completly.
/DCC-0.4.7/DCC/main.py @DCC_InputFiles/samplesheet_BM10 -T 2 -D -N -R /ref/GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf -an /ref/gencode.v26.primary_assembly.annotation.gtf -F -M -Nr 1 1 -fg -G -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa -O /MSBB_rep/BM10 -B @DCC_InputFiles/bam_files_BM10 -t MSBB_rep/tmp_dir/BM10
Hi @JunmingH,
I don't think I ever ran DCC on such a large number of sample - therefore the merging might indeed take a lot of time.
However, you might tune some of the parameters to reduce the time:
-T
does not apply to the merging step, that part is single threaded, but you might want to increase the number of threads to speed up the rest of the computation
-Nr 1 1 is a very low threshold that will yield a very high number of false positives, i.e. its enough if a circle is only seen once with one BSJ read in one sample. I would recommend something like --Nr 2,5
or even -Nr 5,10
- meaning 2 reads in at least 5 samples or 5 reads in at least 10. THis also should speed up the merging.
The memory error is probably related to the extremely huge spare matrix that contains mostly zeros holding circRNA counts
Let me know if the new parameters can address your issue.
Cheers, Tobias
Hi Tobias,
One of my scripts is successful. It used 400 Gb memory for filter function. But it still stuck at a place.
I am not sure which part it is:
The error message is : "[E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found ........." Have you ever see this before?
BTW very appreciate for your help!
Dear @JunmingH,
You may want to create a new index for your genome fasta file.
Try samtools faidx genome.fasta
in the folder of the genome file. Maybe the index is corrupted or incomplete.
Cheers, Tobias
Do you mean -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa this file?
Yes, exactly.
Hi Tobias,
Is this warning normal?, I got this when doing linear counting
2020-03-24 09:52:35,105 WARNING: circRNA start position ('chr18', '45196357') does not have mapped read counts, treated as 0
Hi, @tjakobi , I'm stuck with the same part. My error information is as follows:
Filtering by read counts
Count CircSkip junctions
Traceback (most recent call last):
File "/usr/bin/DCC", line 11, in <module>
load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
File "build/bdist.linux-x86_64/egg/DCC/main.py", line 472, in main
File "build/bdist.linux-x86_64/egg/DCC/main.py", line 669, in findCircSkipJunction
File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 224, in getAdjacent
ValueError: invalid literal for int() with base 10: '1_3182'
started circRNA detection from file
I have no idea that whether if my gtf annotation, which was downloaded from NCBI, is not right, my gtf:
##gff-version 3
##source-version rtracklayer 1.46.0
##date 2020-04-13
NC_003977.2 RefSeq region 1 3182 . + . ID=NC_003977.2:1_3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw;
NC_003977.2 RefSeq gene 1376 1840 . + . ID=gene-HBVgp3;Dbxref=GeneID:944566;gbkey=Gene;Name=X;gene=X;gene_biotype=protein_coding;locus_tag=HBVgp3;
NC_003977.2 RefSeq CDS 1376 1840 . + 0 ID=cds-YP_009173867.1;Dbxref=Genbank:YP_009173867.1,GeneID:944566;gbkey=CDS;Name=YP_009173867.1;gene=X;locus_tag=HBVgp3;Parent=gene-HBVgp3;product=X protein;protein_id=YP_009173867.1;
NC_003977.2 RefSeq regulatory_region 1592 1602 . + . ID=id-NC_003977.2:1592_1602;gbkey=regulatory;Note=direct repeat sequence 2 (DR2)%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2 RefSeq regulatory_region 1775 1794 . + . ID=id-NC_003977.2:1775_1794;gbkey=regulatory;Note=phi%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2 RefSeq sequence_difference 1775 1775 . - . ID=id-NC_003977.2:1775_1775;gbkey=misc_difference;Note=G is A in ref[2]%3b~conflict
NC_003977.2 RefSeq gene 1816 2454 . + . ID=gene-HBVgp4;Dbxref=GeneID:944568;gbkey=Gene;Name=C;gene=C;gene_biotype=protein_coding;locus_tag=HBVgp4;
But CircCoordinates and CircCount were still generated. Two separate chunks not affect each other ?
Thanks for your time.
Try ucsc annotation file. This one did not have chromosome number. Junming
@JunmingH Thanks. It's a genome of Hepatitis B virus and there is no information in UCSC.
Hi, sorry for the delayed response.
There might be an issue with the format of the 9th column, and DCC running into problems gathering data from it. DCC normally uses GTF files and you are providing a GFF3 file, which might result in problems. You might want to try using a GTF formatted file just to rule this possibility out. See https://www.biostars.org/p/99462/ for details on the differences.
Thanks, it is indeed a problem of GTF.
Hi,
I am trying to run DCC but always stuck at circFilter.py The error message is below. Traceback (most recent call last): File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 842, in
main()
File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 375, in main
filt.filter_nonrep(rep_file, indx0, count0)
File "/circu_RNA/DCC-0.4.8/DCC/circFilter.py", line 110, in filter_nonrep
nonrep = np.column_stack((indx0, count0))
File "/share/pkg.7/python2/2.7.16/install/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 640, in column_stack
return _nx.concatenate(arrays, 1)
MemoryError
Could you please give me some ideas about this?
BTW is that possible it spends two days on combining those results together? Since each time it takes a very long time to combine, but still, have a memory error. I only use 4 threads and request 252 GB.
Best