ShiuLab / PseudogenePipeline

Creative Commons Zero v1.0 Universal
36 stars 18 forks source link

Formatting files #4

Open jcerca opened 3 years ago

jcerca commented 3 years ago

Hi,

I was really impressed with the pipeline you developed and would like to use it for my own work. I managed to install the pipeline and run the test, however with my own data it has proven itself difficult. Here is what I did:

cp ../../08_subgenomeGFF/annotation.subgenomeA.gff3 . 
cp ../../../03_circosplots/02_Subgenomes_A_vs_B/00_data/scalesia_subgenomeA.fasta .
cp ../../../05_Orthofinder/01_sub_genomes/SCSGA.scalesia_subgenomeA.faa .
sed -i "s/_HRSCAF=[0-9]\+//" *faa  #### cleaned up the header

makeblastdb -in scalesia_subgenomeA.fasta -dbtype nucl -out blastdb #### made DB from genome

# tblastn
tblastn -num_threads 30 \
-db blastdb \
-query SCSGA.scalesia_subgenomeA.faa \
-outfmt 7 \
-seg yes > blast_output.txt

## cleaning up files (making them similar to the test_files)
# Removing comment lines
grep -v "^#"  blast_output.txt > blast_output_cleaned.txt
grep -v "^#"  annotation.subgenomeA.gff3 > annotation.subgenomeA_cleaned.gff3

# Removing contig lines.
grep -v "contig" *cleaned.gff3 > annotation.subgenomeA_cleaned_cleaned.gff3

#Changing - to _
sed -i "s/\-/_/g" blast_output_cleaned.txt
sed -i  "s/\-/_/g; s/\ .*//"  SCSGA.scalesia_subgenomeA.faa
sed -i  "s/\-/_/g" annotation.subgenomeA_cleaned_cleaned.gff3

When running the script I get the following error:

python ../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py subgenome_parameter_file
Begin Execution  [0:00:00]
Check parameter file...  [0:00:00]
Determining intron length [0:00:13]
No length given. Calculating intron length distribution
No introns present in the GFF file, calculating intron size using gene and CDS information
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_pipeline_scripts/Seq_removeGFFaltSplicing.py", line 20, in <module>
    isLongest = bool(int(lineLst[8].split("longest=")[1].split(";")[0]))
IndexError: list index out of range
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 543, in <module>
    main()
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 291, in main
    P["il_t"] = int(math.ceil(percentile(intron_sizes,.95)))
TypeError: a float is required

Given that my gff does not have "longest=", (since I already selected the longest isoform (assuming that is what is needed here). I did:

# The test gff has only the following features:
awk '{print $3}' test_gff | sort -u
CDS
five_prime_UTR
gene
mRNA
three_prime_UTR

# Ours has many more. I'll be cleaning these.
awk '{print $3}' annotation.subgenomeA.gff3 | sort -u
CDS
contig
exon
expressed_sequence_match
five_prime_UTR
gene
mRNA
match
match_part
protein_match
three_prime_UTR

## Selecting the fiels as in the Arabidopsis example
awk '$3 == "CDS" || $3 == "gene" || $3 == "five_prime_UTR" || $3 == "mRNA" || $3 == "three_prime_UTR" ' annotation.subgenomeA_cleaned_cleaned.gff3 > annotation.subgenomeA_cleaned_cleaned_cleaned.gff3

## Renaming gff3 to gff - in case that's the problem
mv annotation.subgenomeA_cleaned_cleaned_cleaned.gff3 annotation.subgenomeA_cleaned_cleaned_cleaned.gff

## Here I get the lines where column 3 is "mRNA" and substitute "Parent", with ";longest=1;Parent" - to try this. Then I translate white spaces to tabs.
awk '$3=="mRNA"{gsub("Parent","longest=1;Parent",$9)}1' annotation.subgenomeA_cleaned_cleaned_cleaned.gff | tr " " "\t" > annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff

## Just renaming the aa and genome file so they have the same ending (.fa) as the test dataset
mv scalesia_subgenomeA.fasta scalesia_subgenomeA.fa
mv SCSGA.scalesia_subgenomeA.faa SCSGA.scalesia_subgenomeA.fa

This got me the following error: the gff.noAlt is not generated (which I guess it'd be expected and OK)

python ../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py subgenome_parameter_file
Begin Execution  [0:00:00]
Check parameter file...  [0:00:00]
Determining intron length [0:00:13]
No length given. Calculating intron length distribution
No introns present in the GFF file, calculating intron size using gene and CDS information
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_pipeline_scripts/Seq_removeGFFaltSplicing.py", line 21, in <module>
    pacID = lineLst[8].split("pacid=")[1].split(";")[0]
IndexError: list index out of range
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 543, in <module>
    main()
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 291, in main
    P["il_t"] = int(math.ceil(percentile(intron_sizes,.95)))
TypeError: a float is required

Thank you for your time. Here's the parameter's file btw:

cat subgenome_parameter_file
b_out=blast_output_cleaned.txt
p_seq=SCSGA.scalesia_subgenomeA.fa
g_seq=scalesia_subgenomeA.fa
b_filter=y
b_data=tabular
p_codes=../../../../local_bin/PseudogenePipeline/_pipeline_scripts
o_codes=../../../../local_bin/PseudogenePipeline/_pipeline_scripts
f_dir=/data/bigexpansion/jcerca/local_bin/fasta-36.3.8h/bin
f_prog=tfasty36
blosum=../../../../local_bin/PseudogenePipeline/_example_files/blosum50.matrix
gff=annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff
repCut=300
repDiv=30

And the files:

head annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff
ScDrF4C_1633    maker   gene    65920   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0;Name=maker_ScDrF4C_1633_snap_gene_0.0
ScDrF4C_1633    maker   mRNA    65920   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1;longest=1;Parent=maker_ScDrF4C_1633_snap_gene_0.0;Name=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1;_AED=0.00;_eAED=0.00;_QI=279|1|1|1|0.8|0.66|6|174|188
ScDrF4C_1633    maker   five_prime_UTR  65920   66198   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:five_prime_utr;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     66199   66323   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     66444   66471   .       +       1       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68284   68511   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68593   68656   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68755   68814   .       +       2       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     69639   69700   .       +       2       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   three_prime_UTR 69701   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:three_prime_utr;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1

head SCSGA.scalesia_subgenomeA.fa
 >maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
MEVRQEATMDNLSPEQQPPLPSPSKQPKPSPVPVDTNSVTQRLQKELMALMMSGDLGVSAFPKGESIFTWIGTIEGGKGTMYEGLSYKLSLQFPLDYPFKPPQVKFETMCFHPNIDQFGNICLDILQDKWSAAYDCRTILLSIQSLLGEPNIESPLNSSAASLWSNKEDYRKMVHKHYAAEGTCRELI
>maker_ScDrF4C_1633_augustus_gene_0.12_mRNA_1
MVAAHLLQHTPTWAVATVCFVFICLGLLIEHLFDIVSHWLKEHRKTSMYEAIEKLKSALMLLGLMSLILVSTQRTIAKICIPNEVANSMLPCHRIYEETMQTTEAYELHYATPARFEPLALEDNMVRAKHRRLDLVNMSATSNYSTGHCASKGMTSFISEDGINQLNTFICVLAIMQIVYSVVTVALGRAKMRSWKAWEQETQTLEYMVANGILYNQII
>maker_ScDrF4C_1633_snap_gene_2.6_mRNA_1
MLKSQINQSHSLHNLLPNRKAFVRGDTTNHSSSNAYSPANLRQHSSTKKSNSTRASSTSTVGNIKAISIPFLTKASTVKCVITVQPTISSALAGISVGGVVDGVSDLLGRSFLLELVSNDLDSKGNQKTVKAYASYEALDFNINVYTYKCDFDVPEDFGEIGAVLVENEYSKKMFFKNIVLNNGVTCTCESWVHSKYDNPEKRIFFTDKSYLPSETPTALKSLREKDMESLRGNGEGEPPDQSLDLARPVLGGEAHPYPRRCRTGREMSSKDPLTESRTLLPFYVPADEDFSEIKSVNFGAKTLYSVLHAVVPILDSIVTDKDKGFPLFTSIDLLYNEGVNVPPPDSGILTALPRLVKGATDAANTVIKFETPETIDRDTFSWFRDEEFCRQMLAGLNPCSIQLVTEWPLMSKLDPEIYGPAESAITKEIVEEEIKGFMTLEEALAQKKLFMLDYHDLLLPYVNKTRELKGISLYGSRTLMFLTPAGTLRPLAIELTRPPSDGKSQWRHVYTPAWDATGAWLWKLAKAHVLAHDSTYHQLVSHWLRTHCVTEPYIIATNRQLSQMHPIRRLLLPHFRYTMQINSLARLLLINAGSTFSPGKYCMQICSDAYDQLWRFDQEALPADLISRGMAVEDSTAPHGIKLTIEDYPFANDGLLIYDAIKQWATSYVNHYYPQANLVESDEELQAWWNEIRTVGHGDKKDEPWWPQLKTQDDLIEIVSTIMWVPSGHHSAVNFGQYDFGGYFPNRPTISRTKMPNEDPTDEEWQSFIKRPEDALLKCFPSQIQATQVMSVLDVLSSHSPEEEYIGGNIEAAWEVEPVIKAAFEEFRERLNELEAIIDSRNMDPNFKNRSGAGLVPYQLLKPYSEKGVTGRGVPNSISI
>augustus_masked_ScDrF4C_1633_processed_gene_2.14_mRNA_1
MAWKLAKAHVLAHDSGYHQLVSHWLRTHCVTEPYIIATNRQLSQMHPIRRLLLPHFRYTMQINALARLLLINAEGIIESTFSPGKYCMQLCSDAYDQLWRFDHETLPSDLISRGMGVEDPTAPHGVKLTIEDYPFANDGLLIFDAIKKWATSYVNHYYPQANLVESDEELQAWWNKIRTVGHGDNKDEPWWPQLKTQDDLIGIVSTIMWVTSGHHLAVNFGQYDFAGYFPNRPTISRTKMPNEDPTDEEWQSFLKRPEDALLNCFPSQIQSTKVMSVLDVLSSHSPEEEYIGGITEAAWEAEPAIKVAFEEFHGRLNELEEIIDSRNTNPNLKNRSGAGLVPYQLLKPYSEKG
>maker_ScDrF4C_1633_augustus_gene_3.8_mRNA_1
MRGVITVLPTITDALAQVTVGIIGNVADSVTDLLGKSFFVELVSDDLDSRGKEKKTVNSYASCDGLDDESKTYKYEFEIEVPEGFGDIGAILVENIRHKEAYIKDIVLDDGNVIFPCESWIHSKHDNPEKRIFFTDKSYLPSDTPVGLKSLREKDLESLRGNGEGERKPFERIYDYDVYNDIGDPESNSDLARPVLGGKTHPYPRRCRTGREMTSIEPWSESRTLLPFYVPRDEDFSEIKAVTFGATALYSVLHGVIPTLESNLTDPNKGFSSFQDIGSLYNEGVDMLPSDHNGLLSAIPRLVKSIASTKKSVLQFETPRMNDRDSFSWFRDEEFCRQTLAGLNPYSIQLVTEWPLMSKLDPKVYGPAESAITKEIVEQEIKGFMTLEEALEQKKLFLLDYHDLLIPYVNKVRELEGTTLYGSRTLMFLTSTGTLRPLAIELTRPPNNGKPQWKHVYTPCWDATGAWLWKLAKAHVLAHDSGHHQLISHWLRTHCVTEPYIIAANRHLSKMHPIQRLLCPHFRYTMEINGLARLALINAGGTIETSFSLLKYSMQFSSDVYAQQWRFDHEALPADLINRGMAVEDESAPHGIKLTIQDYPFANDGLLLWDAITQWATTYINHYYPQPNLVQSDTELQAWWTEIQTVGHGDKKDEPWWPQLKTQQDLIKIVTTIMWVSSGHHSAVNFGQYDFAGYFPNRPTIARTKMPNEDPTVEEWEAFINKPEDVLLDSFPTQIQATKVMSVLDILSNHSPDEEYIGSTMEPAWAADPAIKAAFEEFSGGLKELEGIIDLRNNDPNLRNRSGAGLLPYELFKPFSEPGVTGKGVPYSISI

grep ">" scalesia_subgenomeA.fa
scalesia_subgenomeA.fa:>ScDrF4C_6
scalesia_subgenomeA.fa:>ScDrF4C_1633
scalesia_subgenomeA.fa:>ScDrF4C_1634
scalesia_subgenomeA.fa:>ScDrF4C_9
scalesia_subgenomeA.fa:>ScDrF4C_4
scalesia_subgenomeA.fa:>ScDrF4C_25
scalesia_subgenomeA.fa:>ScDrF4C_21
scalesia_subgenomeA.fa:>ScDrF4C_19
scalesia_subgenomeA.fa:>ScDrF4C_10
scalesia_subgenomeA.fa:>ScDrF4C_17
scalesia_subgenomeA.fa:>ScDrF4C_14
scalesia_subgenomeA.fa:>ScDrF4C_24
scalesia_subgenomeA.fa:>ScDrF4C_30
scalesia_subgenomeA.fa:>ScDrF4C_5
scalesia_subgenomeA.fa:>ScDrF4C_18
scalesia_subgenomeA.fa:>ScDrF4C_15
scalesia_subgenomeA.fa:>ScDrF4C_16
bio15anu commented 3 years ago

I'm not sure if you ever solved this issue, but for anybody who runs into the same thing you have to be aware that the Seq_removeGFFaltSplicing.py script is looking for both a "longest=" field and a "pacid=" field in column 9 of your GFF file. As far as I can tell this PAC refers to a "phytozome accession ID" and the purpose here is simply a way of grouping each feature within each alternative isoform to the same mRNA entry. For each feature (mRNA, exon, CDS, UTRs) the pacid can be replaced with the "ID" of the parent mRNA entry and the script should execute properly.

So to extend upon the already good solution by @jcerca this is what I did to solve the issue fully:

awk -F "\t" -v OFS="\t" '{
if($3=="mRNA"){gsub("Parent","longest=1;Parent",$9); split($9,ID,";"); split(ID[1],pac,"="); $9=$9";pacid="pac[2]};
if($3=="CDS"||$3=="exon"||$3~/UTR$/){split($9,ID,";"); split(ID[2],pac,"="); $9=$9";pacid="pac[2]}
}1' original.gff > cleaned.gff

NB: This is assuming the "ID" is the first field for each mRNA feature, and the "Parent" is the second field for each CDS, exon and UTR feature, as with the example GFF provided above.