Closed shedurkin closed 3 months ago
I did also try googling, and I found someone with the exact same issue, but there was no resolution
Do not collapse- use fasta as is.
When I originally ran kallisto index on my un-collapsed CDS fasta file I recieved the error:
Error: repeated name in FASTA file /home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/data/Porites_evermanni_CDS.fasta Porites_evermani_scaffold_1:486174-486885 Run with --make-unique to replace repeated names with unique names
I was able to look at the CDS gff file and see that duplicate listing is real (screenshot below), so I assumed there could be other duplicates that would require collapsing
This Fasta is from bedtools? Only get mRNAs
I see, that should fix it! Can I ask why we're using only mRNA? My understanding is that CDS are protein-coding sequences, so I was thinking that CDS sequences were kind of a proxy for mRNA, since presumably any protein coding sequence would eventually be transcribed/translated into mRNA. Is the difference just that CDS still contains untrranslated regions that we don't want?
I extracted only mRNA lines in the gff, used the gff and a scaffold fasta to build a fasta file of every mRNA sequence, and then tried building a kallisto index with that fasta. I was able to build an index, but I'm getting another issue now.
During indexing I was warned: [build] warning: replaced 10722052 non-ACGUT characters in the input sequence with pseudorandom nucleotides
Since the file contains 40,389 sequences, that means kallisto index is identifying an average of 265 non-ACGUT character per sequence, which seems super concerning. The fasta file is too big to look at all at once, but I copied some of the first few sequences into a google doc to try character-searching for non-ACGUT characters, and several of the first few sequences have very long stretches of just Ns in the middle of the read (example pic below). Does anyone have any idea why this is happening? Looking through the gff, some of the mRNAs also look to be much longer than I would expect -- could that be related?
long stretches of just Ns in the middle of the read
This is usually the nature of scaffolded assemblies, and is why the term "scaffold" is used instead of "contig."
Briefly, a "contig" is an abbreviation for "contiguous." As you might be able to guess, a "contig" is a contiguous sequence, with no gaps.
A "scaffold" is sort of the structure on the way to figuring out the entire sequence (contig). As such, it's kind of a place holder. There's sequencing data that indicates that the 5' and 3' sequences are part of the same region of DNA, however, there's not enough sequencing data to sequence the entire stretch of DNA. Thus, the two ends of the DNA fragment which were sequenced are "joined" by a stretch of N
to indicate that the nucleotides between the 5' and 3' ends of the sequencing data is not known.
BTW, this info is certainly not meant to be condescending! Just am not sure how much you (or others in the lab who might encounter this issue) know about how genome assemblies are generated and represented.
Oh interesting, that's super helpful! So, if that means the mRNA fasta has been extracted correctly, then the Ns shouldn't be a problem for pseudoalignment, right? My basic understanding of kallisto pseudoalignment is that it avoids directly aligning reads to a reference by instead breaking reads into k-mers and then mapping them to identify which reference sequence the read likely originated from. It seems like internal stretches of Ns (or the "pseudorandom nucleotides" they're replaced with) shouldn't interefere with that process
Can you please share code you used (link to file in GitHub repo) to extract mRNA seqs?
sure, here's the current .md file
Kallisto indexing problem is now being addressed in https://github.com/RobertsLab/resources/issues/1795
looking more at it I think the mRNA
feature is misleading and maybe is gene
.... Easiest and funnest way to proceed is to load the evermanni gff in IGV to decide how to get a fasta file of evermanni trancripts <- ultimate goal
meaning it looks like mRNA
has introns which would be a problem for alignment
I've been trying to use fastx_collapser to collapse a CDS fasta file to unique sequences only (so I can build a kallisto index with it). However, I've been getting the persitent error:
"fastx_collapser: Invalid input: This looks like a multi-line FASTA file. Line 149 contains a nucleotides string instead of a '>' prefix."
I examined the first several lines of my CDS fasta and, while it looked like it was already single-line, I ran it through fasta_format to format it as a single-line fasta just to be sure. However, I'm still getting the same error. I also looked specifically at Line 149 and it starts with a '>' before the sequence ID and the following sequence is single-line...
Does anyone have any idea why fastx_collapser doesn't want to run on my CDSfasta file? Maybe someone's had this issue before?
file linked here, can also see a screenshot of relevant section below