Open zillurbmb51 opened 4 years ago
Basically, it's up to you to add the SARS-CoV2 genome sequence to the genome generation step. One of the many hundreds of genomes already public is labelled as RefSeq and I notice that many research groups are using that one. STAR's genome generation step takes multiple FASTA files, so you would do -genomeFastaFiles hg38.fasta WHU01.fasta
to make the combined genome.
You also need to append the five genes to the end your transcript GTF file. I manually typed it:
NC_045512 GenBank gene 1 21554 . + . gene_id "CoV2G1"; gene_type "gene"; gene_name "ORF1ab"
NC_045512 GenBank transcript 1 21554 . + . gene_id "CoV2G1"; transcript_id "CoV2T1"; gene_type "gene"; gene_name "ORF1ab"
NC_045512 GenBank exon 1 21554 . + . gene_id "CoV2G1"; transcript_id "CoV2T1"; gene_type "gene"; gene_name "ORF1ab"; exon_number 1
NC_045512 GenBank gene 21562 25383 . + . gene_id "CoV2G2"; gene_type "gene"; gene_name "SurfaceGly"
NC_045512 GenBank transcript 21562 25383 . + . gene_id "CoV2G2"; transcript_id "CoV2T2"; gene_type "gene"; gene_name "SurfaceGly"
NC_045512 GenBank exon 21562 25383 . + . gene_id "CoV2G2";transcript_id "CoV2T2"; gene_type "gene"; gene_name "SurfaceGly"; exon_number 1
NC_045512 GenBank gene 26244 26471 . + . gene_id "CoV2G3"; gene_type "gene"; gene_name "Envelope"
NC_045512 GenBank transcript 26244 26471 . + . gene_id "CoV2G3"; transcript_id "CoV2T3"; gene_type "gene"; gene_name "Envelope"
NC_045512 GenBank exon 26244 26471 . + . gene_id "CoV2G3"; transcript_id "CoV2T3"; gene_type "gene"; gene_name "Envelope"; exon_number 1
NC_045512 GenBank gene 26522 27190 . + . gene_id "CoV2G4"; gene_type"gene"; gene_name "Matrix"
NC_045512 GenBank transcript 26522 27190 . + . gene_id "CoV2G4"; transcript_id "CoV2T4"; gene_type="gene"; gene_name="Matrix"
NC_045512 GenBank exon 26522 27190 . + . gene_id "CoV2G4"; transcript_id "CoV2T4"; gene_type "gene"; gene_name "Matrix"; exon_number 1
NC_045512 GenBank gene 28273 29881 . + . gene_id "CoV2G5"; gene_type "gene"; gene_name "Nucleocapsid"
NC_045512 GenBank transcript 28273 29881 . + . gene_id "CoV2G5"; transcript_id "CoV2T5"; gene_type "gene"; gene_name "Nucleocapsid"
NC_045512 GenBank exon 28273 29881 . + . gene_id "CoV2G5"; transcript_id "CoV2T5"; gene_type "gene"; gene_name "Nucleocapsid"; exon_number 1
Feel free to copy and paste this into your own GTF file. It's a different format if you use GFF3.
Hi @zillurbmb51 , Dario
good suggestions from Dario. We actually want to look into mapping CoV2+human RNA-seq, to see if any parameters need to be optimized. If you know of any good public dataset for us to play with, please let us know.
Cheers Alex
Hi all, I have some 2.5kb SARS-CoV-2 amplicon data here: https://downloads.pacbcloud.com/public/dataset/SarsCov2-Eden-ATCC/
These won't have human RNA-seq. But those should be coming soon. -Liz
Hi Liz,
thanks a lot! These are from full CoV2 genome, right? So we expect to see full sequences, not transcripts?
Cheers Alex
Hi all, Thank you vey much. I need an annotated genome where genes are identified in the fasta file. How can I get this? The fasta Liz provided is almost same as the refseq where only one fasta identifier (>). We already know that there are 11 genes in the genome. Is it possible to find a fasta file where all these genes are identified? such as:
geneX sequences geneY sequences and so on........ Best Regards Zillur Rahman Phd Student at Bioinformatics Lab University of Puerto Rico, Rio-Piedras
On Sun, May 10, 2020 at 5:58 AM Elizabeth Tseng notifications@github.com wrote:
Hi all, I have some 2.5kb SARS-CoV-2 amplicon data here: https://downloads.pacbcloud.com/public/dataset/SarsCov2-Eden-ATCC/
These won't have human RNA-seq. But those should be coming soon. -Liz
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexdobin/STAR/issues/900#issuecomment-626251338, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADB5I47C5U2CN7IHAIAEG2DRQXUZRANCNFSM4M3DE5AQ .
Hi Zillur,
the easiest way, I think, is to use UCSC browser Cov2 hub. In the table browser, select group:Gene and Gene Predictions, track: NCBI genes, and output format: sequence.
Cheers Alex
There's a RNA data set published in Cell. However, they used ordinary STAR and UMI-tools instead of STARsolo. Strange decision.
Hi, Previously I used this excellent tool to map human rna-seq. Can I use STAR to map coronavirus nucleotide fastq files? If yes, what are general guidelines and where can I find the reference genome? If no, is there any appropriate tool to do this? Best, Zillur