Bulabula45 / Readon

5 stars 1 forks source link

For any custom genomes #2

Open Zhuxitong opened 6 days ago

Zhuxitong commented 6 days ago

Hi,

Is there a pipeline to obtain required input annotation files for any other genomes? For example, I have rice genome sequences in fasta and annotations in gff format, how to prepare the required files similar to those files under hg38REF directory.

We did not download those files from Emsembl so I think the script preprocess_for_other_organisms.ipynb can not be easily used.

Thank you so much!

Bulabula45 commented 2 days ago

Hi, you can prepare the cDNA file like this,


from Bio import SeqIO

records = SeqIO.parse("yourGenome.fa", "fasta")
chr2seq = {}
for record in records:
    id = str(record.id)
    chr2seq[record.id] = str(record.seq)

REVERSE_COMPLEMENT = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N'}
def extract_seq(chr, start, end):
    seq = chr2seq[chr][start:end]
    return seq

with open("Organism.cdna.fa", 'w') as w:
    with open("your.gff") as f:
        for line in f:
            if line.startswith("#"):
                continue
            fields = line.strip().split("\t")
            if fields[2] == "transcript":
                chr = fields[0]
                tstart = int(fields[3]) - 1
                tend = int(fields[4])
                strand = fields[6]
                transcript_id = "up to your format";gene_id = "up to your format"
                line = f.readline()
                fields = line.strip().split("\t")
                exon_seqs = ""
                while line.startswith("chr") and fields[2] == "exon":
                    estart = int(fields[3]) - 1;eend = int(fields[4])
                    exon_seqs += extract_seq(chr, estart, eend)
                    line = f.readline()
                    fields = line.strip().split("\t")

                exon_seqs = exon_seqs.upper()
                exon_seqs = exon_seqs if strand == "+" else "".join([REVERSE_COMPLEMENT[base] for base in exon_seqs[::-1]])
                if exon_seqs:
                    w.write(f">{transcript_id}|{gene_id}|{str(tstart)}|{str(tend)}|{chr}|{strand}\n{exon_seqs}\n")