Identification of P evermanni transcript fasta

shedurkin commented 5 months ago

I've been trying to use fastx_collapser to collapse a CDS fasta file to unique sequences only (so I can build a kallisto index with it). However, I've been getting the persitent error:

"fastx_collapser: Invalid input: This looks like a multi-line FASTA file. Line 149 contains a nucleotides string instead of a '>' prefix."

I examined the first several lines of my CDS fasta and, while it looked like it was already single-line, I ran it through fasta_format to format it as a single-line fasta just to be sure. However, I'm still getting the same error. I also looked specifically at Line 149 and it starts with a '>' before the sequence ID and the following sequence is single-line...

Does anyone have any idea why fastx_collapser doesn't want to run on my CDSfasta file? Maybe someone's had this issue before?

file linked here, can also see a screenshot of relevant section below Screenshot (227)

shedurkin commented 5 months ago

I did also try googling, and I found someone with the exact same issue, but there was no resolution

sr320 commented 5 months ago

Do not collapse- use fasta as is.

shedurkin commented 5 months ago

When I originally ran kallisto index on my un-collapsed CDS fasta file I recieved the error:

Error: repeated name in FASTA file /home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/data/Porites_evermanni_CDS.fasta Porites_evermani_scaffold_1:486174-486885 Run with --make-unique to replace repeated names with unique names

I was able to look at the CDS gff file and see that duplicate listing is real (screenshot below), so I assumed there could be other duplicates that would require collapsing Screenshot (228)

sr320 commented 5 months ago

This Fasta is from bedtools? Only get mRNAs

shedurkin commented 5 months ago

I see, that should fix it! Can I ask why we're using only mRNA? My understanding is that CDS are protein-coding sequences, so I was thinking that CDS sequences were kind of a proxy for mRNA, since presumably any protein coding sequence would eventually be transcribed/translated into mRNA. Is the difference just that CDS still contains untrranslated regions that we don't want?

shedurkin commented 5 months ago

I extracted only mRNA lines in the gff, used the gff and a scaffold fasta to build a fasta file of every mRNA sequence, and then tried building a kallisto index with that fasta. I was able to build an index, but I'm getting another issue now.

During indexing I was warned: [build] warning: replaced 10722052 non-ACGUT characters in the input sequence with pseudorandom nucleotides

Since the file contains 40,389 sequences, that means kallisto index is identifying an average of 265 non-ACGUT character per sequence, which seems super concerning. The fasta file is too big to look at all at once, but I copied some of the first few sequences into a google doc to try character-searching for non-ACGUT characters, and several of the first few sequences have very long stretches of just Ns in the middle of the read (example pic below). Does anyone have any idea why this is happening? Looking through the gff, some of the mRNAs also look to be much longer than I would expect -- could that be related? Screenshot (229)

kubu4 commented 5 months ago

long stretches of just Ns in the middle of the read

This is usually the nature of scaffolded assemblies, and is why the term "scaffold" is used instead of "contig."

Briefly, a "contig" is an abbreviation for "contiguous." As you might be able to guess, a "contig" is a contiguous sequence, with no gaps.

A "scaffold" is sort of the structure on the way to figuring out the entire sequence (contig). As such, it's kind of a place holder. There's sequencing data that indicates that the 5' and 3' sequences are part of the same region of DNA, however, there's not enough sequencing data to sequence the entire stretch of DNA. Thus, the two ends of the DNA fragment which were sequenced are "joined" by a stretch of N to indicate that the nucleotides between the 5' and 3' ends of the sequencing data is not known.

BTW, this info is certainly not meant to be condescending! Just am not sure how much you (or others in the lab who might encounter this issue) know about how genome assemblies are generated and represented.

shedurkin commented 5 months ago

Oh interesting, that's super helpful! So, if that means the mRNA fasta has been extracted correctly, then the Ns shouldn't be a problem for pseudoalignment, right? My basic understanding of kallisto pseudoalignment is that it avoids directly aligning reads to a reference by instead breaking reads into k-mers and then mapping them to identify which reference sequence the read likely originated from. It seems like internal stretches of Ns (or the "pseudorandom nucleotides" they're replaced with) shouldn't interefere with that process

kubu4 commented 5 months ago

Can you please share code you used (link to file in GitHub repo) to extract mRNA seqs?

shedurkin commented 5 months ago

sure, here's the current .md file

kubu4 commented 5 months ago

Kallisto indexing problem is now being addressed in https://github.com/RobertsLab/resources/issues/1795

sr320 commented 5 months ago

looking more at it I think the mRNA feature is misleading and maybe is gene.... Easiest and funnest way to proceed is to load the evermanni gff in IGV to decide how to get a fasta file of evermanni trancripts <- ultimate goal

meaning it looks like mRNA has introns which would be a problem for alignment

RobertsLab / resources

Identification of P evermanni transcript fasta #1793