RobertsLab / resources

https://robertslab.github.io/resources/
18 stars 10 forks source link

How to get genes from pycno genome? #1965

Open grace-ac opened 3 hours ago

grace-ac commented 3 hours ago

want to be able to align sea star coelomocyte RNA seq data to the pycno genome.

The files provided that I have access to are of transcripts. Wiki page of the key files I have here

Is there a way to get the genes? I think @sr320 said that you did things with oyster stuff recently related to this?

sr320 commented 2 hours ago

Can you show head of alignment/ Deseq output so we know what to match... and go ahead and add head / description of what genome files are available here.

(Updated: no need for code)

kubu4 commented 2 hours ago

Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?

The GTF file appears to have gene annotations, so you can extract genes from that if you need their positions. Are you asking for code on how to accomplish this task?

sr320 commented 2 hours ago

I think @kubu4 is on base with gtf idea.

I think another way to frame issue is: I have this count matrix file from hisat/ deseq output file but I am having trouble finding annotations associated with the iDs?

@grace-ac Is this accurate?

grace-ac commented 2 hours ago

head of Deseq output

log2 fold change (MLE): condition exposed vs control 
Wald test p-value: condition exposed vs control 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE      stat     pvalue      padj
          <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
g20392.t1  0.204575       1.475366  3.114837  0.473657 0.63574420        NA
g13761.t1  1.981251      -2.209603  2.073068 -1.065861 0.28648645 0.4346174
g4199.t4  84.995694       0.861167  0.825231  1.043546 0.29669543 0.4456007
g1746.t1   6.779187       0.807499  0.809692  0.997292 0.31862293 0.4689647
g16359.t1 99.172769      -0.907016  0.317377 -2.857846 0.00426528 0.0165324
g22398.t1 71.385761      -0.551942  0.308878 -1.786922 0.07395015 0.1585745

Genome files available:

P. helianthoides genome page from NCBI: here

The downloaded genome from the dataset lives on Raven in:
/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this:

>CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence
gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca
gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga

P. helianthoides genome gene predictions from Dryad: here

Scheibelhut et. al 2023
though all of these files appear to be transcripts as each gene ends in a ".t#'.

augustus.hints.aa

>g3450.t1
MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL
EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG*

augustus.hints.codingseq

>g3450.t1
ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG
CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA

augustus.hints.gtf

pycn_heli.0392  AUGUSTUS    gene    1   4781    0.95    +   .   g1
pycn_heli.0392  AUGUSTUS    transcript  1   4781    0.95    +   g1.t1
pycn_heli.0392  AUGUSTUS    intron  1   4506    0.95    +   .   transcript_id "g1.t1"; gene_id "g1";

interproscan.gff3

##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.61-93.0

interproscan.tsv

g3531.t1    e585d160537c159881dfb8cf27e10150    635 CDD cd13970ABC1_ADCK3   288 537 2.58694E-146    T   19-04-2023  IPR034646   ADCK3-like domain   -
g3531.t1    e585d160537c159881dfb8cf27e10150    635 PANTHER PTHR43851   -   6   633 8.3E-243    T   19-04-2023  -   -
g3531.t1    e585d160537c159881dfb8cf27e10150    635 MobiDBLite  mobidb-lite consensus disorder prediction   151 176 -   T   19-04-2023  -   -

Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?

Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done!

I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think

sr320 commented 2 hours ago

Somebody double check- but looks like your alignment count matrix IDs correspond to - augustus.hints.codingseq So these annotations would work…

But another question is - is this hisat out put based on genes or transcript level.

On Tue, Sep 24, 2024 at 2:50 PM Grace Crandall @.***> wrote:

head of Deseq output

log2 fold change (MLE): condition exposed vs control Wald test p-value: condition exposed vs control DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj

g20392.t1 0.204575 1.475366 3.114837 0.473657 0.63574420 NA g13761.t1 1.981251 -2.209603 2.073068 -1.065861 0.28648645 0.4346174 g4199.t4 84.995694 0.861167 0.825231 1.043546 0.29669543 0.4456007 g1746.t1 6.779187 0.807499 0.809692 0.997292 0.31862293 0.4689647 g16359.t1 99.172769 -0.907016 0.317377 -2.857846 0.00426528 0.0165324 g22398.t1 71.385761 -0.551942 0.308878 -1.786922 0.07395015 0.1585745 Genome files available: *P. helianthoides* genome page from NCBI: here The downloaded genome from the dataset lives on Raven in: /home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this: >CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga *P. helianthoides* genome gene predictions from Dryad: here Scheibelhut et. al 2023 though all of these files appear to be transcripts as each gene ends in a ".t#'. augustus.hints.aa >g3450.t1 MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG* augustus.hints.codingseq >g3450.t1 ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA augustus.hints.gtf pycn_heli.0392 AUGUSTUS gene 1 4781 0.95 + . g1 pycn_heli.0392 AUGUSTUS transcript 1 4781 0.95 + g1.t1 pycn_heli.0392 AUGUSTUS intron 1 4506 0.95 + . transcript_id "g1.t1"; gene_id "g1"; interproscan.gff3 ##gff-version 3 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269 ##interproscan-version 5.61-93.0 interproscan.tsv g3531.t1 e585d160537c159881dfb8cf27e10150 635 CDD cd13970ABC1_ADCK3 288 537 2.58694E-146 T 19-04-2023 IPR034646 ADCK3-like domain - g3531.t1 e585d160537c159881dfb8cf27e10150 635 PANTHER PTHR43851 - 6 633 8.3E-243 T 19-04-2023 - - g3531.t1 e585d160537c159881dfb8cf27e10150 635 MobiDBLite mobidb-lite consensus disorder prediction 151 176 - T 19-04-2023 - - Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right? Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done! I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you were mentioned.Message ID: ***@***.***>