alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 497 forks source link

How to identify the Sequencing-GAP of a gene inserted by a transposon. #1766

Open Wenwen012345 opened 1 year ago

Wenwen012345 commented 1 year ago

Dear @alexdobin

It's a great piece of software and we achieved our assumptions.

Now there is a problem that our manuscript was asked a question by the reviewers. Questions are as follows: "L481-500 and Fig 6C, are there any verifications of these TE-containing long genes? Do these genes contain sequencing gaps? Gap-containing genes should be filtered out since their assembly is not complete."

Mainly about the picture below. The reviewer felt that the genes we showed seemed "a bit long" and might not be in line with common sense. Maybe there was a sequencing-gap in it. We need to provide evidence.

image

You know, the transcriptome assembly software we use was STAR; the software for measuring gene expression is Stringtie2; and the software for measuring transposon expression is TEtranscripts. And all the assemblies were based on GFF3 files or GTF files (The genomes' was downloaded from NCBI and the transposons' was generated by TEsorter), but I have roughly observed that there seems to be no errors in GFF3 files (I'm not sure since I haven't been involved with bioinformatics for long.) . Genes didn't seem to be pictured as "longer". However, at present I'm also confused about the way to achieve the goal. I have not thought of a good method to identify the sequencing-gap. Do you have any good suggestions?

alexdobin commented 1 year ago

Hi Wenwen,

the only suggestion I can make is: use a different pipeline (different tools) to see if you can reproduce the predictions of the original pipeline.

Wenwen012345 commented 1 year ago

Hi Wenwen,

the only suggestion I can make is: use a different pipeline (different tools) to see if you can reproduce the predictions of the original pipeline.

Thank you for your kind reply. We thought of another way yesterday. It is to extract the sequences of these genes, and then see if there is a sequenced GAP? According to the reviewer, this should be a problem with the DNA sequence. If there is a sequencing GAP at the corresponding gene, it should be represented by "N", right?

In short, after we extracted these genes, no sequencing GAP was found. Is it possible to explain the integrity of the sequencing?