alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Coronavirus mapping #900

Open zillurbmb51 opened 4 years ago

zillurbmb51 commented 4 years ago

Hi, Previously I used this excellent tool to map human rna-seq. Can I use STAR to map coronavirus nucleotide fastq files? If yes, what are general guidelines and where can I find the reference genome? If no, is there any appropriate tool to do this? Best, Zillur

DarioS commented 4 years ago

Basically, it's up to you to add the SARS-CoV2 genome sequence to the genome generation step. One of the many hundreds of genomes already public is labelled as RefSeq and I notice that many research groups are using that one. STAR's genome generation step takes multiple FASTA files, so you would do -genomeFastaFiles hg38.fasta WHU01.fasta to make the combined genome.

You also need to append the five genes to the end your transcript GTF file. I manually typed it:

NC_045512   GenBank gene    1   21554   .   +   .   gene_id "CoV2G1"; gene_type "gene"; gene_name "ORF1ab"
NC_045512   GenBank transcript  1   21554   .   +   .   gene_id "CoV2G1"; transcript_id "CoV2T1"; gene_type "gene"; gene_name "ORF1ab"
NC_045512   GenBank exon    1   21554   .   +   .   gene_id "CoV2G1"; transcript_id "CoV2T1"; gene_type "gene"; gene_name "ORF1ab"; exon_number 1
NC_045512   GenBank gene    21562   25383   .   +   .   gene_id "CoV2G2"; gene_type "gene"; gene_name "SurfaceGly"
NC_045512   GenBank transcript  21562   25383   .   +   .   gene_id "CoV2G2"; transcript_id "CoV2T2"; gene_type "gene"; gene_name "SurfaceGly"
NC_045512   GenBank exon    21562   25383   .   +   .   gene_id "CoV2G2";transcript_id "CoV2T2"; gene_type "gene"; gene_name "SurfaceGly"; exon_number 1
NC_045512   GenBank gene    26244   26471   .   +   .   gene_id "CoV2G3"; gene_type "gene"; gene_name "Envelope"
NC_045512   GenBank transcript  26244   26471   .   +   .   gene_id "CoV2G3"; transcript_id "CoV2T3"; gene_type "gene"; gene_name "Envelope"
NC_045512   GenBank exon    26244   26471   .   +   .   gene_id "CoV2G3"; transcript_id "CoV2T3"; gene_type "gene"; gene_name "Envelope"; exon_number 1
NC_045512   GenBank gene    26522   27190   .   +   .   gene_id "CoV2G4"; gene_type"gene"; gene_name "Matrix"
NC_045512   GenBank transcript  26522   27190   .   +   .   gene_id "CoV2G4"; transcript_id "CoV2T4"; gene_type="gene"; gene_name="Matrix"
NC_045512   GenBank exon    26522   27190   .   +   .   gene_id "CoV2G4"; transcript_id "CoV2T4"; gene_type "gene"; gene_name "Matrix"; exon_number 1
NC_045512   GenBank gene    28273   29881   .   +   .   gene_id "CoV2G5"; gene_type "gene"; gene_name "Nucleocapsid"
NC_045512   GenBank transcript  28273   29881   .   +   .   gene_id "CoV2G5"; transcript_id "CoV2T5"; gene_type "gene"; gene_name "Nucleocapsid"
NC_045512   GenBank exon    28273   29881   .   +   .   gene_id "CoV2G5"; transcript_id "CoV2T5"; gene_type "gene"; gene_name "Nucleocapsid"; exon_number 1

Feel free to copy and paste this into your own GTF file. It's a different format if you use GFF3.

alexdobin commented 4 years ago

Hi @zillurbmb51 , Dario

good suggestions from Dario. We actually want to look into mapping CoV2+human RNA-seq, to see if any parameters need to be optimized. If you know of any good public dataset for us to play with, please let us know.

Cheers Alex

Magdoll commented 4 years ago

Hi all, I have some 2.5kb SARS-CoV-2 amplicon data here: https://downloads.pacbcloud.com/public/dataset/SarsCov2-Eden-ATCC/

These won't have human RNA-seq. But those should be coming soon. -Liz

alexdobin commented 4 years ago

Hi Liz,

thanks a lot! These are from full CoV2 genome, right? So we expect to see full sequences, not transcripts?

Cheers Alex

zillurbmb51 commented 4 years ago

Hi all, Thank you vey much. I need an annotated genome where genes are identified in the fasta file. How can I get this? The fasta Liz provided is almost same as the refseq where only one fasta identifier (>). We already know that there are 11 genes in the genome. Is it possible to find a fasta file where all these genes are identified? such as:

geneX sequences geneY sequences and so on........ Best Regards Zillur Rahman Phd Student at Bioinformatics Lab University of Puerto Rico, Rio-Piedras

On Sun, May 10, 2020 at 5:58 AM Elizabeth Tseng notifications@github.com wrote:

Hi all, I have some 2.5kb SARS-CoV-2 amplicon data here: https://downloads.pacbcloud.com/public/dataset/SarsCov2-Eden-ATCC/

These won't have human RNA-seq. But those should be coming soon. -Liz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexdobin/STAR/issues/900#issuecomment-626251338, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADB5I47C5U2CN7IHAIAEG2DRQXUZRANCNFSM4M3DE5AQ .

alexdobin commented 4 years ago

Hi Zillur,

the easiest way, I think, is to use UCSC browser Cov2 hub. In the table browser, select group:Gene and Gene Predictions, track: NCBI genes, and output format: sequence.

Cheers Alex

DarioS commented 4 years ago

There's a RNA data set published in Cell. However, they used ordinary STAR and UMI-tools instead of STARsolo. Strange decision.