mis-entryof strand information in gtf for ref building

alexdobin / STAR

RNA-seq aligner

MIT License

1.87k stars 506 forks source link

mis-entryof strand information in gtf for ref building #1922

Open bioviz84 opened 1 year ago

bioviz84 commented 1 year ago

Hi ALex I have retrieved the fasta sequences from a bed file using the bedtools (without specifying the strand information), and then I treated each sequence as a gene and combined it with a human reference to create a custom reference. I generated the custom gtf file with start and end defined as the start and end of the gene since I manually attached each transcript to the referenceie (start=1 to stop =len(seq)), but I recorded all of the strand information as "+" (this is an oversight I made because I forgot to keep the strand information from the bed file). I used STAR to perform the alignment, however I wonder if I missed any alignent to the actual sequences that had "-" strand information in the bed file. Can you please tell me how the Algorithm works in this scenario. Please let me know if you need more information if it's not clear. Any help in this regard will be appreciated. bioviz

alexdobin commented 1 year ago

Hi @bioviz84

for mapping, only the splice junction coordinates are extracted from the GTF, so it should not be affected.

bioviz84 commented 1 year ago

Thank you for the reply.

Can I please ask you how can I get the TPM for a transcript in forward strand and a reverse strand from STAR output?

Another naive query, I have some ORF finding for my gene of interest , can I make a custom reference based on that, like is there any minimum length to be included in genome build that can give a confident alignment? Thanks, BioViZ

alexdobin commented 1 year ago

Calculating transcript expression requires using a transcript quantifying tool, such as RSEM or Salmon.

There is no minimum length for the reference sequence (of course they should be longer than the read length).

bioviz84 commented 1 year ago

Hi Alex,

Thanks for the reply,

I have a custom reference human+3k hERV loci ,trying to revise it by adding a couple of hERV ORFs of our interest , but those are sub-sets of already included 3kHERVs. In this case I believe there will be a sizable number of reads that map at both herv ORFs and 3k hervs . What would be the best way to get the accurate quantification in this scenario. ?

[I thought of running "--quantmode GeneCounts" with "--outFilterMultimapNmax 100", but Seeing this thread, doesn't appear to work in the way I anticipate.

What would be the ideal way to get accurate read counts mapped to these ORFs.? Thanks, Bioviz

alexdobin commented 1 year ago

Hi @bioviz84

identical sequences in the reference will result in reads multi-mapping to >1 loci, which makes quantification ambiguous. If you want to count multi-mappers, featureCounts has an option to do it.