Open JuiTse opened 2 years ago
Hi @alexdobin , thank for your help, it is my ignorance of the input format.
The program run fine after I convert annotation file to gtf, but it will be accidentally killed without any error message. Could you help me to figure out what is wrong with it? Best Regards, Jui-Tse Log.zip
Dear @alexdobin , I have retried the command with limitGenomeGenerateRAM, and successfully finished all SA indices. (so maybe the previous question could be skipped) However, the program aborted when packing SA, Mar 08 12:32:26 ..... started STAR run Mar 08 12:32:26 ... starting to generate Genome files Mar 08 12:43:39 ... starting to sort Suffix Array. This may take a long time... Mar 08 12:46:31 ... sorting Suffix Array chunks and saving them to disk... Mar 08 17:43:18 ... loading chunks from disk, packing SA... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
The following is my code with attachment of log file, STAR --runThreadN 12 --limitGenomeGenerateRAM 96000000000 --runMode genomeGenerate --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149 Should I set genomeChrBinNbits or something else to solve it? If the genomeChrBinNbits should be set, should the number be: log2(26435649536/7370) or log2(22382379008/12) based on my reference genome Log.zip ? Best regards, Jui-Tse
Hi Jui-Tse,
this is a huge genome, ~26Gbases. If you have 96GB of RAM, you would need --genomeSAsparseD 3
to fit this genome into memory.
Best, Alex
Hi @alexdobin , I have tried --genomeSAsparseD 3, but failed again. Then I tried to increase it to 6, and it run almost successfully, but finally failed when processing annotations GTF. The following is my code: STAR --runThreadN 12 --limitGenomeGenerateRAM 100000000000 --genomeSAsparseD 6 --runMode genomeGenerate --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149
The following is the error message and log file: Mar 09 10:47:21 ..... started STAR run Mar 09 10:47:21 ... starting to generate Genome files Mar 09 10:58:09 ... starting to sort Suffix Array. This may take a long time... Mar 09 10:59:16 ... sorting Suffix Array chunks and saving them to disk... Mar 09 11:40:41 ... loading chunks from disk, packing SA... Mar 09 11:55:56 ... finished generating suffix array Mar 09 11:55:56 ... generating Suffix Array index Mar 09 12:04:17 ... completed Suffix Array index Mar 09 12:04:17 ..... processing annotations GTF Segmentation fault (core dumped Log.zip )
I now have no idea what is wrong with it. Hope you can give some advises, really thanks. Best regards, Jui-Tse
Hi, @alexdobin sorry for keep bothering, I have changed to use another version 2.7.9a, and finally run successfully, but with some warnings, !!!!! WARNING: while processing sjdbGTFfile=./P_tabuliformis.gtf, line: chr4 transcriptome exon -2103759131 -2103758793 . - . transcript_id "Pt4G66550.7"; gene_id "Pt4G66550"; gene_name "Pt4G66550"; exon end = 18446744071605792823 is larger than the chromosome chr4 length = 2192534405 , will skip this exon
Are these warnings OK to be skipped? Or I should fix something? Log.zip
Best, Jui-Tse
Hi, I have kept stuck in different processes. Just summarize what I have encountered,
a. No gene_id: WARNING: while processing pGe.sjdbGTFfile=./P_tabuliformis.gtf: no gene_id for line: chr6 transcriptome exon 701089437 701089571 . - . transcript_id "Pt6G22210.2";
b. Exon length longer than chromosome length: !!!!! WARNING: while processing sjdbGTFfile=./P_tabuliformis.gtf, line: chr4 maker exon -2114491853 -2114491661 . - . transcript_id "Pt4G66050.1"; gene_id "Pt4G66050"; gene_name "Pt4G66050"; exon end = 18446744071595059955 is larger than the chromosome chr4 length = 2192534405 , will skip this exon
c. Long junction repeat: WARNING: long repeat for junction # 204384 : tig00002302_1_2 500782 505631; left shift = 255; right shift = 255
Even if the indexing and mapping in STAR run successfully, the error will keep appeared in picard or samtools.
For picard error: Exception in thread "main" java.lang.NumberFormatException: For input string: "2364278061" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:583) at java.lang.Integer.parseInt(Integer.java:615) at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:214) at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:113) at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:298) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:176) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:406) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:265) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:507) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
All the trials above follow this code: for indexing: STAR --runThreadN 8 --limitGenomeGenerateRAM 97000000000 --genomeSAsparseD 8 --genomeChrBinNbits 15 --runMode genomeGenerate --genomeDir ./index_P.tabuliformis/ --genomeFastaFiles ./P.tabuliformis_V1.0_new.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149
for mapping: (only execute under2.7.9a, because 2.6.1a and 2.7.2 get stucked when undexing) STAR --genomeDir ./index_P.tabuliformis/ --runThreadN 12 --readFilesIn ../fasta_after_trimmomatic/Pine_T733-02-T89_good_trimed_1_paired.fq.gz ../fasta_after_trimmomatic/Pine_T733-02-T89_good_trimed_2_paired.fq.gz --readFilesCommand zcat --outFileNamePrefix ./mapped_ind/T733-02-T89 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard
I have used completely same code but by different species, and get the successful result even in picard. I am wondering whether this is the error in gtf or raw genome fasta files of this species.
Thank for helping me, Jui-Tse Log2.6.1e.zip Log2.7.9a.zip Log2.7.2.zip P_tabuliformis_gtf.zip P_tabuliformis_gff3.zip ?
Hi Jui-Tse,
it looks like the GTF file has exons with negative coordinates which is not allowed. You need to fix it before proceeding.
Best, Alex
Hi, @alexdobin sorry I am not quite familiar with the production of GTF/GFF file, I just download from database.
Is the error derived from the process of format conversion from GFF3 to GTF? (I use this command to convert: gffread -T P_tabuliformis_V2.gff3 -o P_tabuliformis.gtf) Or it may just exist in the original GFF3?
Could you give me some hints to fix it? Thank you for spending time on this.
Best, Jui-Tse
Hi Jui-Tse,
the coordinates are not supposed to change in the GFF3->GTF transformation, so I suspect the problem is in the original GFF3 file.
Hi, @alexdobin I will try first to just remove the annotation sites with negative coordinates first (if there are not a lot). If it still not work, then I will further contact the author. If there is anything new, I will update it. Thank for your kind responses and help : )
Hi, I am recently mapping the transcriptome on large genome size with about 25G. I have run the following code successfully, but get the empty indexing files: sjdbList.fromGTF.out.tab and sjdbList.out.tab These would make the latter picard MarkDuplicate error.
STAR --runThreadN 12 --limitGenomeGenerateRAM 100000000000 --runMode genomeGenerate --genomeSAsparseD 5 --genomeSAindexNbases 10 --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis_V2.gff3 --sjdbOverhang 149
I have use the same code on another smaller genome size within 1G and performed successfully (but without --limitGenomeGenerateRAM 100000000000, --genomeSAsparseD 5, and--genomeSAindexNbases 10, which were mostly suitable for large genome size).
So I am wondering whether it is the large genome size and insufficient memory that cause this happened? How to fix it? ( By free-g, there seems to be 100G memory available, total used free shared buff/cache available Mem: 110 7 85 0 17 101 Swap: 1 1 0)
Log_.zip
Thank for the help!