gencorefacility / reform

Modify existing reference fasta and gff3/gtf files to include a new sequence
29 stars 5 forks source link

Question on the gff file of the inserted sequence #3

Closed ggstatgen closed 5 years ago

ggstatgen commented 5 years ago

Hi guys

Thanks for developing reform - sounds like an awesome tool. I have just stumbled on it and am trying to figure out if this could be what I need for an analysis I need to perform.

Essentially, I need to create a modified mouse chromosome where the exon of a gene has been replaced by a stop cassette to knock out the gene. The insertion will contain additional sequence on the 5' and 3' prime end of the stop cassette. I do have the full insertion sequence in fasta.

My purpose is to obtain a 'custom' mouse chromosome which includes the above deactivated gene sequence. It seems your tool is ideal for doing this, however I'm a bit unclear on the meaning of one of the arguments you request in order for the program to run, namely --in_gff. What should this file contain in my case?

In my understanding, if my insertion sequence was, say, 3Mb long and contained several genes, the gff would contain the absolute coordinates of the genes/exons/transcripts/TSSs/TTSs in this 3MB fasta sequence (where by absolute I mean the first nucleotide in the inserted sequence is at position 0).

Here's a concrete example. Let's say the novel sequence to insert contains only one gene, for example Pax6, described by the Gencode gff3 catalogue as follows

chr2    ENSEMBL transcript  105675513   105697361   .   +   .
chr2    ENSEMBL exon    105675513   105675649   .   +   .
chr2    ENSEMBL exon    105675744   105675972   .   +   .
chr2    ENSEMBL exon    105679810   105679889   .   +   .
chr2    ENSEMBL exon    105680247   105680307   .   +   .
chr2    ENSEMBL CDS 105680298   105680307   .   +   0
chr2    ENSEMBL start_codon 105680298   105680300   .   +   0
chr2    ENSEMBL exon    105683828   105683958   .   +   .
chr2    ENSEMBL CDS 105683828   105683958   .   +   2
chr2    ENSEMBL exon    105684751   105684792   .   +   .
chr2    ENSEMBL CDS 105684751   105684792   .   +   0
chr2    ENSEMBL exon    105684887   105685102   .   +   .
chr2    ENSEMBL CDS 105684887   105685102   .   +   0
chr2    ENSEMBL exon    105685778   105685943   .   +   .
chr2    ENSEMBL CDS 105685778   105685943   .   +   0
chr2    ENSEMBL exon    105691565   105691723   .   +   .
chr2    ENSEMBL CDS 105691565   105691723   .   +   2
chr2    ENSEMBL exon    105692194   105692276   .   +   .
chr2    ENSEMBL CDS 105692194   105692276   .   +   2
chr2    ENSEMBL exon    105692470   105692620   .   +   .
chr2    ENSEMBL CDS 105692470   105692620   .   +   0
chr2    ENSEMBL exon    105692737   105692852   .   +   .
chr2    ENSEMBL CDS 105692737   105692852   .   +   2
chr2    ENSEMBL exon    105695306   105695456   .   +   .
chr2    ENSEMBL CDS 105695306   105695456   .   +   0
chr2    ENSEMBL exon    105696270   105697361   .   +   .
chr2    ENSEMBL CDS 105696270   105696355   .   +   2
chr2    ENSEMBL stop_codon  105696353   105696355   .   +   0
chr2    ENSEMBL five_prime_UTR  105675513   105675649   .   +   .
chr2    ENSEMBL five_prime_UTR  105675744   105675972   .   +   .
chr2    ENSEMBL five_prime_UTR  105679810   105679889   .   +   .
chr2    ENSEMBL five_prime_UTR  105680247   105680297   .   +   .
chr2    ENSEMBL three_prime_UTR 105696356   105697361   .   +   .

(note I'm only showing the first 8 columns of the gff for clarity here).

Given the above, how would I go about creating a suitable gff input file for reform? Would I need to use a tool to manually annotate the exons/UTRs in my novel fasta (eg MAKER) and pass the resulting gff to reform? Or something else entirely? Apologies if I'm missing something obvious.

mohammedkhalfan commented 5 years ago

Hello,

You would simply need to adjust the co-ordinates of the gff above to relative to the insertion sequence.

For example, if the transcript defined above started at position 1 (the first nucleotide) of your inserted sequence, the first 2 gff lines would look like this:

chr2    ENSEMBL transcript  1   21849   .   +   .
chr2    ENSEMBL exon    1   137 .   +   .