Kingsford-Group / scallop

Scallop is a reference-based transcriptome assembler for RNA-seq
BSD 3-Clause "New" or "Revised" License
89 stars 18 forks source link

Which parameter can be used to filter the scallop result? #35

Open Huangyizhong opened 2 years ago

Huangyizhong commented 2 years ago

hi ,there The scallop is a good software to assembly the illumina data and I got lots of transcripts that other softwares can not. When I use the ORFfinder to predict the ORF with the scallop results. I got lots of transcripts without the classical splice site, such as the GT-AG,GC-AG or AT-AC. As shown in the picture1, the scallop results were not the same as the other data. Lots of scallop transcripts were not the classical splice site. Is there some parameters can be used to filter it ? As also the picture 2, the transcript looks so strange! Thanks so much! Sincerely Yizhong Huang

image

image

shaomingfu commented 2 years ago

Hi Yizhong,

Re question 1: Scallop fully uses the splice sites predicted by the aligner. So far it does not contain any model or parameter to detect / filter out poorly supported non-canonical splice sites. We will probably add such feature in future releases. But for now, you may try: 1, check if certain aligner such as STAR or HISAT2 provide such parameters to control splice sites, and/or 2, write a script of your own to filter the assembled transcripts (by Scallop).

Re question 2: the assembled transcripts seem strange to me too. Is this sample strand-specific? If so did you specify library-type when running Scallop?

Best, Mingfu

Huangyizhong commented 2 years ago

@shaomingfu Thanks so much for your quick reply! It is a pity that the scallop has no the parameter to filter the splice sites. I have checked the annotation file of the human using the gffread software, and almost all the transcripts are the canonical splice sites. May be I can use the gffread to filter these directly. How can I get the proper thread of the reads number to filter the undesired transcripts? As shown in the picture1, the scallop transcript has two more bases (CT) than other data. The strange transcript I have attached is not the strand-specific, how to deal with it ? I just run the scallop as follows: ${scallop} -i ${bam[$PBS_ARRAYID]} -o ${output}/${NAME}_scallop.gtf. Thanks again for your kind help Sincerely Yizhong Huang