alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

FATAL ERROR in input read file: the total length of barcode sequence is 151 not equal to expected 28, and also not mapped reads #2202

Open lopezCascales opened 1 month ago

lopezCascales commented 1 month ago

Hi Alex, I have some issues with a data that I have receipt from collaborators, I have normally worked with Smart-seq plate based protocols, but this samples are 10x, and I don't have a lot of information. I have followed some others issues workflows, but I dont get why I cant solved the problem. I dont know if Im doing well, but Im goind to copy more or less the steps that Im following. With the fastq that I have, I can see that I have sequencer specific @Kxxxx - HiSeq 3000(?)/4000, and Flowcells ending with BBXX? HiSeq 3000/4000 run.

Example of 1 sample- > gunzip *.fastq.gz

cat file1.fastq file2.fastq > bigfile.fastq

cat file.fastq | head -n40

@K00360:651:HHKHYBBXY:1:1101:3640:1086 1:N:0:NCTCGTTT

NCTCGTTT

+

##################################################### The only information of the sample hashtag_oligo|well10X|RunID TGTCTTTCCTGCCAG | 3 | 20200811-SCS47-2

With that I supposed that I need to use this index for the barcodes 3M-february-2018.txt I did a Genome index for Human of 100 ( I don't know if its enough) ##################################################### Based on different answers on forums, I used this code for STAR

STAR --genomeDir ./indexHuman100 --readFilesIn 20200811-SCS47-2-HT_S4_R2.fastq 20200811-SCS47-2-HT_S4_R1.fastq --outFileNamePrefix scRNA20200811-SCS47-2-HT --outFilterType BySJout --outFilterMultimapNmax 20 --alignIntronMax 100000 --outFilterMismatchNmax 4 --outFilterMatchNminOverLread 0.3 --outFilterScoreMinOverLread 0.3 --outFilterScoreMin 30 --alignEndsType Local --soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --runThreadN 128 --clipAdapterType CellRanger4 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB NH HI GX GN --soloFeatures Gene --soloCBwhitelist 3M-february-2018.txt

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 151 not equal to expected 28 Read ID=@K00360:651:HHKHYBBXY:1:1101:3640:1086 ; Sequence=NGGTACATCGGTAATTCCCTTTCGAGGTTTGCTAGGACCGGCNGTANAGNCCGANGGCTNNACATCTGGCAACCGNANTTCATNANANCNGAAGAGNANACGNCTGAACTCCAGTCACTCTCGTTTATCTCGTATGCCGTCTTCTGCTTGA SOLUTION: check the formatting of input read files. If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength To avoid checking of barcode read length, specify --soloBarcodeReadLength 0

######################################################## --soloBarcodeReadLength 150
--soloBarcodeReadLength 151 I add this 2 options, the firs its not working, the second one worked, Aug 22 18:19:57 ..... started STAR run Aug 22 18:19:58 ..... loading genome Aug 22 18:20:42 ..... started mapping Aug 22 18:26:19 ..... finished mapping Aug 22 18:26:20 ..... started Solo counting Aug 22 18:26:36 ..... finished Solo counting Aug 22 18:26:36 ..... started sorting BAM Aug 22 18:26:39 ..... finished successfully

######################################################## But this is the log out file,

                             Started job on |   Aug 22 18:19:57
                         Started mapping on |   Aug 22 18:20:42
                                Finished on |   Aug 22 18:26:39
   Mapping speed, Million of reads per hour |   513.01

                      Number of input reads |   50873627
                  Average input read length |   150
                                UNIQUE READS:
               Uniquely mapped reads number |   148635
                    Uniquely mapped reads % |   0.29%
                      Average mapped length |   116.95
                   Number of splices: Total |   26365
        Number of splices: Annotated (sjdb) |   26216
                   Number of splices: GT/AG |   26276
                   Number of splices: GC/AG |   75
                   Number of splices: AT/AC |   9
           Number of splices: Non-canonical |   5
                  Mismatch rate per base, % |   1.93%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.77
                    Insertion rate per base |   0.02%
                   Insertion average length |   1.62
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   19370
         % of reads mapped to multiple loci |   0.04%
    Number of reads mapped to too many loci |   77
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 50691944 % of reads unmapped: too short | 99.64% Number of reads unmapped: other | 13601 % of reads unmapped: other | 0.03% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

For the solo.out noNoAdapter 0 noNoUMI 0 noNoCB 0 noNinCB 0 noNinUMI 4566 noUMIhomopolymer 1216 noNoWLmatch 316720 noTooManyMM 0 noTooManyWLmatches 0 yesWLmatchExact 49288951 yesOneWLmatchWithMM 476773 yesMultWLmatchWithMM 785401

and the matrix.txt

%%MatrixMarket matrix coordinate integer general % 62710 6794880 81035

#########################################################

There are not mapped reads. Do you have any suggestions? Thank you in advance for your help Have a nice day. Mayte

lopezCascales commented 1 month ago

sorry I didnt copy the unmapped reads
UNMAPPED READS: Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 50691944 % of reads unmapped: too short | 99.64% Number of reads unmapped: other | 13601 % of reads unmapped: other | 0.03% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%