breister2 / Clostridium_scindens_mining

0 stars 0 forks source link

Download GFF files of the 120 UHGG C. scindens genomes #4

Open breister2 opened 1 year ago

breister2 commented 1 year ago

In order to download the GFF files using the ftp links provided, curl was used.

After obtaining the GFF files, they were converted to fasta files using the script /storage1/data19/Scripts/python_scripts/Convert_GFF_to_Fasta.py

import sys

input_gff_file = sys.argv[1]
output_fasta_file = input_gff_file.replace(".gff", ".fasta")
test = False

with open(input_gff_file, "r") as input, open(output_fasta_file, "w") as output:
    for line in input:
        if "##FASTA" not in line and test == False:
            continue
        elif "##FASTA" in line:
            test = True
            continue

        if test == True:
            output.write(line)

To confirm accuracy of the obtained fasta files, a custom python script was used to count the number of scaffolds and the number of nucleotides in the file.

breister2 commented 1 year ago

The following genomes obtained from GFF files seemed to be incomplete when compared to the project metadata: