alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

Empty content for the output files sjdbList.fromGTF.out.tab and sjdbList.out.tab after indexing on large genome size #1492

Open JuiTse opened 2 years ago

JuiTse commented 2 years ago

Hi, I am recently mapping the transcriptome on large genome size with about 25G. I have run the following code successfully, but get the empty indexing files: sjdbList.fromGTF.out.tab and sjdbList.out.tab These would make the latter picard MarkDuplicate error.

STAR --runThreadN 12 --limitGenomeGenerateRAM 100000000000 --runMode genomeGenerate --genomeSAsparseD 5 --genomeSAindexNbases 10 --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis_V2.gff3 --sjdbOverhang 149

I have use the same code on another smaller genome size within 1G and performed successfully (but without --limitGenomeGenerateRAM 100000000000, --genomeSAsparseD 5, and--genomeSAindexNbases 10, which were mostly suitable for large genome size).

So I am wondering whether it is the large genome size and insufficient memory that cause this happened? How to fix it? ( By free-g, there seems to be 100G memory available, total used free shared buff/cache available Mem: 110 7 85 0 17 101 Swap: 1 1 0)

Log_.zip

Thank for the help!

alexdobin commented 2 years ago

Hi @JuiTse

you would need to convert GFF3 file to GTF, using, e.g., gffread.

Cheers Alex

JuiTse commented 2 years ago

Hi @alexdobin , thank for your help, it is my ignorance of the input format.

The program run fine after I convert annotation file to gtf, but it will be accidentally killed without any error message. Could you help me to figure out what is wrong with it? Best Regards, Jui-Tse Log.zip

JuiTse commented 2 years ago

Dear @alexdobin , I have retried the command with limitGenomeGenerateRAM, and successfully finished all SA indices. (so maybe the previous question could be skipped) However, the program aborted when packing SA, Mar 08 12:32:26 ..... started STAR run Mar 08 12:32:26 ... starting to generate Genome files Mar 08 12:43:39 ... starting to sort Suffix Array. This may take a long time... Mar 08 12:46:31 ... sorting Suffix Array chunks and saving them to disk... Mar 08 17:43:18 ... loading chunks from disk, packing SA... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

The following is my code with attachment of log file, STAR --runThreadN 12 --limitGenomeGenerateRAM 96000000000 --runMode genomeGenerate --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149 Should I set genomeChrBinNbits or something else to solve it? If the genomeChrBinNbits should be set, should the number be: log2(26435649536/7370) or log2(22382379008/12) based on my reference genome Log.zip ? Best regards, Jui-Tse

alexdobin commented 2 years ago

Hi Jui-Tse,

this is a huge genome, ~26Gbases. If you have 96GB of RAM, you would need --genomeSAsparseD 3 to fit this genome into memory.

Best, Alex

JuiTse commented 2 years ago

Hi @alexdobin , I have tried --genomeSAsparseD 3, but failed again. Then I tried to increase it to 6, and it run almost successfully, but finally failed when processing annotations GTF. The following is my code: STAR --runThreadN 12 --limitGenomeGenerateRAM 100000000000 --genomeSAsparseD 6 --runMode genomeGenerate --genomeDir ./index --genomeFastaFiles ./P_tabuliformis.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149

The following is the error message and log file: Mar 09 10:47:21 ..... started STAR run Mar 09 10:47:21 ... starting to generate Genome files Mar 09 10:58:09 ... starting to sort Suffix Array. This may take a long time... Mar 09 10:59:16 ... sorting Suffix Array chunks and saving them to disk... Mar 09 11:40:41 ... loading chunks from disk, packing SA... Mar 09 11:55:56 ... finished generating suffix array Mar 09 11:55:56 ... generating Suffix Array index Mar 09 12:04:17 ... completed Suffix Array index Mar 09 12:04:17 ..... processing annotations GTF Segmentation fault (core dumped Log.zip )

I now have no idea what is wrong with it. Hope you can give some advises, really thanks. Best regards, Jui-Tse

JuiTse commented 2 years ago

Hi, @alexdobin sorry for keep bothering, I have changed to use another version 2.7.9a, and finally run successfully, but with some warnings, !!!!! WARNING: while processing sjdbGTFfile=./P_tabuliformis.gtf, line: chr4 transcriptome exon -2103759131 -2103758793 . - . transcript_id "Pt4G66550.7"; gene_id "Pt4G66550"; gene_name "Pt4G66550"; exon end = 18446744071605792823 is larger than the chromosome chr4 length = 2192534405 , will skip this exon

Are these warnings OK to be skipped? Or I should fix something? Log.zip

Best, Jui-Tse

JuiTse commented 2 years ago

Hi, I have kept stuck in different processes. Just summarize what I have encountered,

  1. When using 2.7.9a, the software will run completely in indexing and mapping but with three kinds of warnings in indexing:

a. No gene_id: WARNING: while processing pGe.sjdbGTFfile=./P_tabuliformis.gtf: no gene_id for line: chr6 transcriptome exon 701089437 701089571 . - . transcript_id "Pt6G22210.2";

b. Exon length longer than chromosome length: !!!!! WARNING: while processing sjdbGTFfile=./P_tabuliformis.gtf, line: chr4 maker exon -2114491853 -2114491661 . - . transcript_id "Pt4G66050.1"; gene_id "Pt4G66050"; gene_name "Pt4G66050"; exon end = 18446744071595059955 is larger than the chromosome chr4 length = 2192534405 , will skip this exon

c. Long junction repeat: WARNING: long repeat for junction # 204384 : tig00002302_1_2 500782 505631; left shift = 255; right shift = 255

Even if the indexing and mapping in STAR run successfully, the error will keep appeared in picard or samtools.

For picard error: Exception in thread "main" java.lang.NumberFormatException: For input string: "2364278061" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:583) at java.lang.Integer.parseInt(Integer.java:615) at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:214) at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:113) at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:298) at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:176) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:406) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:265) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:507) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

p.s. The string "2364278061", is the first chromosome length based on chrLength.txt file output from STAR.

  1. When using 2.6.1e or 2.7.2, the program will just stop with segmentation error (core dumped) in indexing process, and with "no gene id warnings". I have tried not using runThread, or with different threads (e.g., 6,8, and 10), but still cannot solve the problems.

All the trials above follow this code: for indexing: STAR --runThreadN 8 --limitGenomeGenerateRAM 97000000000 --genomeSAsparseD 8 --genomeChrBinNbits 15 --runMode genomeGenerate --genomeDir ./index_P.tabuliformis/ --genomeFastaFiles ./P.tabuliformis_V1.0_new.fa --sjdbGTFfile ./P_tabuliformis.gtf --sjdbOverhang 149

for mapping: (only execute under2.7.9a, because 2.6.1a and 2.7.2 get stucked when undexing) STAR --genomeDir ./index_P.tabuliformis/ --runThreadN 12 --readFilesIn ../fasta_after_trimmomatic/Pine_T733-02-T89_good_trimed_1_paired.fq.gz ../fasta_after_trimmomatic/Pine_T733-02-T89_good_trimed_2_paired.fq.gz --readFilesCommand zcat --outFileNamePrefix ./mapped_ind/T733-02-T89 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard

I have used completely same code but by different species, and get the successful result even in picard. I am wondering whether this is the error in gtf or raw genome fasta files of this species.

Thank for helping me, Jui-Tse Log2.6.1e.zip Log2.7.9a.zip Log2.7.2.zip P_tabuliformis_gtf.zip P_tabuliformis_gff3.zip ?

alexdobin commented 2 years ago

Hi Jui-Tse,

it looks like the GTF file has exons with negative coordinates which is not allowed. You need to fix it before proceeding.

Best, Alex

JuiTse commented 2 years ago

Hi, @alexdobin sorry I am not quite familiar with the production of GTF/GFF file, I just download from database.

Is the error derived from the process of format conversion from GFF3 to GTF? (I use this command to convert: gffread -T P_tabuliformis_V2.gff3 -o P_tabuliformis.gtf) Or it may just exist in the original GFF3?

Could you give me some hints to fix it? Thank you for spending time on this.

Best, Jui-Tse

alexdobin commented 2 years ago

Hi Jui-Tse,

the coordinates are not supposed to change in the GFF3->GTF transformation, so I suspect the problem is in the original GFF3 file.

JuiTse commented 2 years ago

Hi, @alexdobin I will try first to just remove the annotation sites with negative coordinates first (if there are not a lot). If it still not work, then I will further contact the author. If there is anything new, I will update it. Thank for your kind responses and help : )