alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Custom refgen generation (scRNA) #2234

Open radiasso opened 2 weeks ago

radiasso commented 2 weeks ago

Hi, I need to generate a reference genome with 4 additional genes (4 fluo proteins) from mouse refgen mm10. I already had a working refgen with them but with a previous version of STAR, so now I have to re-generate it (STAR 2.7.11b).

Following instructions, I added the 4 genes at the end of gtf file, tab separated:

hrGFPIINLS      unknown exon    1       798     .       +       .       gene_id "hrGFPIINLS"; transcript_id "hrGFPIINLS"; gene_name "hrGFPIINLS"; gene_biotype protein_coding
EYFP    unknown exon    1       720     .       +       .       gene_id "EYFP"; transcript_id "EYFP"; gene_name "EYFP"; gene_biotype protein_coding
tdimer2 unknown exon    1       1395    .       +       .       gene_id "tdimer2"; transcript_id "tdimer2"; gene_name "tdimer2"; gene_biotype protein_coding
MbmCerulean     unknown exon    1       825     .       +       .       gene_id "MbmCerulean"; transcript_id "MbmCerulean"; gene_name "MbmCerulean"; gene_biotype protein_coding

and added the respective sequences at the end of the fasta file (.fa), like:

>hrGFPIINLS dna:plasmid
ATGGTGAGCAAGCAGATCCTGAAGAACACCGGCCTGCAGGAGATCATGAGCTTCAAGGTG...
>EYFP dna:plasmid
TTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAACTCCAGCAG...
>tdimer2 dna:plasmid tdimer2
ATGGTGGCCTCCTCCGAGGACGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAG...
>MbmCerulean dna:plasmid
TTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAACTCCAGCAG...

and generated the refgen with:

STAR --runThreadN 27 \
     --runMode genomeGenerate \
     --genomeDir ./STAR_index \
     --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly_FLUO.fa\
     --sjdbGTFfile genes_FLUO.gtf \
     --sjdbOverhang 100

It generates the refgen without errors, same with the alignment done (as always) with: STAR --genomeDir=./STAR_index --readFilesIn=R2_001.fastq.gz, R2_001.fastq.gz --runThreadN=12 --soloType Droplet --soloCBwhitelist mylist.txt --soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts --soloUMIlen 12 --sjdbGTFfile=genes_FLUO.gtf --readFilesCommand zcat

The problem arises when I load my features, barcodes and matrix to create an anndata object: ValueError: Length of values (31057) does not match length of index (31053) As if those 4 genes are not actually indexed, maybe?

What am I doing wrong? Thank you so much for your help!