Formatting files - Githubissues

Hi,

I was really impressed with the pipeline you developed and would like to use it for my own work. I managed to install the pipeline and run the test, however with my own data it has proven itself difficult. Here is what I did:

cp ../../08_subgenomeGFF/annotation.subgenomeA.gff3 . 
cp ../../../03_circosplots/02_Subgenomes_A_vs_B/00_data/scalesia_subgenomeA.fasta .
cp ../../../05_Orthofinder/01_sub_genomes/SCSGA.scalesia_subgenomeA.faa .
sed -i "s/_HRSCAF=[0-9]\+//" *faa  #### cleaned up the header

makeblastdb -in scalesia_subgenomeA.fasta -dbtype nucl -out blastdb #### made DB from genome

# tblastn
tblastn -num_threads 30 \
-db blastdb \
-query SCSGA.scalesia_subgenomeA.faa \
-outfmt 7 \
-seg yes > blast_output.txt

## cleaning up files (making them similar to the test_files)
# Removing comment lines
grep -v "^#"  blast_output.txt > blast_output_cleaned.txt
grep -v "^#"  annotation.subgenomeA.gff3 > annotation.subgenomeA_cleaned.gff3

# Removing contig lines.
grep -v "contig" *cleaned.gff3 > annotation.subgenomeA_cleaned_cleaned.gff3

#Changing - to _
sed -i "s/\-/_/g" blast_output_cleaned.txt
sed -i  "s/\-/_/g; s/\ .*//"  SCSGA.scalesia_subgenomeA.faa
sed -i  "s/\-/_/g" annotation.subgenomeA_cleaned_cleaned.gff3

When running the script I get the following error:

python ../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py subgenome_parameter_file
Begin Execution  [0:00:00]
Check parameter file...  [0:00:00]
Determining intron length [0:00:13]
No length given. Calculating intron length distribution
No introns present in the GFF file, calculating intron size using gene and CDS information
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_pipeline_scripts/Seq_removeGFFaltSplicing.py", line 20, in <module>
    isLongest = bool(int(lineLst[8].split("longest=")[1].split(";")[0]))
IndexError: list index out of range
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 543, in <module>
    main()
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 291, in main
    P["il_t"] = int(math.ceil(percentile(intron_sizes,.95)))
TypeError: a float is required

Given that my gff does not have "longest=", (since I already selected the longest isoform (assuming that is what is needed here). I did:

# The test gff has only the following features:
awk '{print $3}' test_gff | sort -u
CDS
five_prime_UTR
gene
mRNA
three_prime_UTR

# Ours has many more. I'll be cleaning these.
awk '{print $3}' annotation.subgenomeA.gff3 | sort -u
CDS
contig
exon
expressed_sequence_match
five_prime_UTR
gene
mRNA
match
match_part
protein_match
three_prime_UTR

## Selecting the fiels as in the Arabidopsis example
awk '$3 == "CDS" || $3 == "gene" || $3 == "five_prime_UTR" || $3 == "mRNA" || $3 == "three_prime_UTR" ' annotation.subgenomeA_cleaned_cleaned.gff3 > annotation.subgenomeA_cleaned_cleaned_cleaned.gff3

## Renaming gff3 to gff - in case that's the problem
mv annotation.subgenomeA_cleaned_cleaned_cleaned.gff3 annotation.subgenomeA_cleaned_cleaned_cleaned.gff

## Here I get the lines where column 3 is "mRNA" and substitute "Parent", with ";longest=1;Parent" - to try this. Then I translate white spaces to tabs.
awk '$3=="mRNA"{gsub("Parent","longest=1;Parent",$9)}1' annotation.subgenomeA_cleaned_cleaned_cleaned.gff | tr " " "\t" > annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff

## Just renaming the aa and genome file so they have the same ending (.fa) as the test dataset
mv scalesia_subgenomeA.fasta scalesia_subgenomeA.fa
mv SCSGA.scalesia_subgenomeA.faa SCSGA.scalesia_subgenomeA.fa

This got me the following error: the gff.noAlt is not generated (which I guess it'd be expected and OK)

python ../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py subgenome_parameter_file
Begin Execution  [0:00:00]
Check parameter file...  [0:00:00]
Determining intron length [0:00:13]
No length given. Calculating intron length distribution
No introns present in the GFF file, calculating intron size using gene and CDS information
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_pipeline_scripts/Seq_removeGFFaltSplicing.py", line 21, in <module>
    pacID = lineLst[8].split("pacid=")[1].split(";")[0]
IndexError: list index out of range
Traceback (most recent call last):
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 543, in <module>
    main()
  File "../../../../local_bin/PseudogenePipeline/_wrapper_scripts/CombinedPseudoWrapper.py", line 291, in main
    P["il_t"] = int(math.ceil(percentile(intron_sizes,.95)))
TypeError: a float is required

Thank you for your time. Here's the parameter's file btw:

cat subgenome_parameter_file
b_out=blast_output_cleaned.txt
p_seq=SCSGA.scalesia_subgenomeA.fa
g_seq=scalesia_subgenomeA.fa
b_filter=y
b_data=tabular
p_codes=../../../../local_bin/PseudogenePipeline/_pipeline_scripts
o_codes=../../../../local_bin/PseudogenePipeline/_pipeline_scripts
f_dir=/data/bigexpansion/jcerca/local_bin/fasta-36.3.8h/bin
f_prog=tfasty36
blosum=../../../../local_bin/PseudogenePipeline/_example_files/blosum50.matrix
gff=annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff
repCut=300
repDiv=30

And the files:

head annotation.subgenomeA_cleaned_cleaned_cleaned_cleaned.gff
ScDrF4C_1633    maker   gene    65920   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0;Name=maker_ScDrF4C_1633_snap_gene_0.0
ScDrF4C_1633    maker   mRNA    65920   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1;longest=1;Parent=maker_ScDrF4C_1633_snap_gene_0.0;Name=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1;_AED=0.00;_eAED=0.00;_QI=279|1|1|1|0.8|0.66|6|174|188
ScDrF4C_1633    maker   five_prime_UTR  65920   66198   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:five_prime_utr;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     66199   66323   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     66444   66471   .       +       1       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68284   68511   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68593   68656   .       +       0       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     68755   68814   .       +       2       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   CDS     69639   69700   .       +       2       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:cds;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
ScDrF4C_1633    maker   three_prime_UTR 69701   69874   .       +       .       ID=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1:three_prime_utr;Parent=maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1

head SCSGA.scalesia_subgenomeA.fa
 >maker_ScDrF4C_1633_snap_gene_0.0_mRNA_1
MEVRQEATMDNLSPEQQPPLPSPSKQPKPSPVPVDTNSVTQRLQKELMALMMSGDLGVSAFPKGESIFTWIGTIEGGKGTMYEGLSYKLSLQFPLDYPFKPPQVKFETMCFHPNIDQFGNICLDILQDKWSAAYDCRTILLSIQSLLGEPNIESPLNSSAASLWSNKEDYRKMVHKHYAAEGTCRELI
>maker_ScDrF4C_1633_augustus_gene_0.12_mRNA_1
MVAAHLLQHTPTWAVATVCFVFICLGLLIEHLFDIVSHWLKEHRKTSMYEAIEKLKSALMLLGLMSLILVSTQRTIAKICIPNEVANSMLPCHRIYEETMQTTEAYELHYATPARFEPLALEDNMVRAKHRRLDLVNMSATSNYSTGHCASKGMTSFISEDGINQLNTFICVLAIMQIVYSVVTVALGRAKMRSWKAWEQETQTLEYMVANGILYNQII
>maker_ScDrF4C_1633_snap_gene_2.6_mRNA_1
MLKSQINQSHSLHNLLPNRKAFVRGDTTNHSSSNAYSPANLRQHSSTKKSNSTRASSTSTVGNIKAISIPFLTKASTVKCVITVQPTISSALAGISVGGVVDGVSDLLGRSFLLELVSNDLDSKGNQKTVKAYASYEALDFNINVYTYKCDFDVPEDFGEIGAVLVENEYSKKMFFKNIVLNNGVTCTCESWVHSKYDNPEKRIFFTDKSYLPSETPTALKSLREKDMESLRGNGEGEPPDQSLDLARPVLGGEAHPYPRRCRTGREMSSKDPLTESRTLLPFYVPADEDFSEIKSVNFGAKTLYSVLHAVVPILDSIVTDKDKGFPLFTSIDLLYNEGVNVPPPDSGILTALPRLVKGATDAANTVIKFETPETIDRDTFSWFRDEEFCRQMLAGLNPCSIQLVTEWPLMSKLDPEIYGPAESAITKEIVEEEIKGFMTLEEALAQKKLFMLDYHDLLLPYVNKTRELKGISLYGSRTLMFLTPAGTLRPLAIELTRPPSDGKSQWRHVYTPAWDATGAWLWKLAKAHVLAHDSTYHQLVSHWLRTHCVTEPYIIATNRQLSQMHPIRRLLLPHFRYTMQINSLARLLLINAGSTFSPGKYCMQICSDAYDQLWRFDQEALPADLISRGMAVEDSTAPHGIKLTIEDYPFANDGLLIYDAIKQWATSYVNHYYPQANLVESDEELQAWWNEIRTVGHGDKKDEPWWPQLKTQDDLIEIVSTIMWVPSGHHSAVNFGQYDFGGYFPNRPTISRTKMPNEDPTDEEWQSFIKRPEDALLKCFPSQIQATQVMSVLDVLSSHSPEEEYIGGNIEAAWEVEPVIKAAFEEFRERLNELEAIIDSRNMDPNFKNRSGAGLVPYQLLKPYSEKGVTGRGVPNSISI
>augustus_masked_ScDrF4C_1633_processed_gene_2.14_mRNA_1
MAWKLAKAHVLAHDSGYHQLVSHWLRTHCVTEPYIIATNRQLSQMHPIRRLLLPHFRYTMQINALARLLLINAEGIIESTFSPGKYCMQLCSDAYDQLWRFDHETLPSDLISRGMGVEDPTAPHGVKLTIEDYPFANDGLLIFDAIKKWATSYVNHYYPQANLVESDEELQAWWNKIRTVGHGDNKDEPWWPQLKTQDDLIGIVSTIMWVTSGHHLAVNFGQYDFAGYFPNRPTISRTKMPNEDPTDEEWQSFLKRPEDALLNCFPSQIQSTKVMSVLDVLSSHSPEEEYIGGITEAAWEAEPAIKVAFEEFHGRLNELEEIIDSRNTNPNLKNRSGAGLVPYQLLKPYSEKG
>maker_ScDrF4C_1633_augustus_gene_3.8_mRNA_1
MRGVITVLPTITDALAQVTVGIIGNVADSVTDLLGKSFFVELVSDDLDSRGKEKKTVNSYASCDGLDDESKTYKYEFEIEVPEGFGDIGAILVENIRHKEAYIKDIVLDDGNVIFPCESWIHSKHDNPEKRIFFTDKSYLPSDTPVGLKSLREKDLESLRGNGEGERKPFERIYDYDVYNDIGDPESNSDLARPVLGGKTHPYPRRCRTGREMTSIEPWSESRTLLPFYVPRDEDFSEIKAVTFGATALYSVLHGVIPTLESNLTDPNKGFSSFQDIGSLYNEGVDMLPSDHNGLLSAIPRLVKSIASTKKSVLQFETPRMNDRDSFSWFRDEEFCRQTLAGLNPYSIQLVTEWPLMSKLDPKVYGPAESAITKEIVEQEIKGFMTLEEALEQKKLFLLDYHDLLIPYVNKVRELEGTTLYGSRTLMFLTSTGTLRPLAIELTRPPNNGKPQWKHVYTPCWDATGAWLWKLAKAHVLAHDSGHHQLISHWLRTHCVTEPYIIAANRHLSKMHPIQRLLCPHFRYTMEINGLARLALINAGGTIETSFSLLKYSMQFSSDVYAQQWRFDHEALPADLINRGMAVEDESAPHGIKLTIQDYPFANDGLLLWDAITQWATTYINHYYPQPNLVQSDTELQAWWTEIQTVGHGDKKDEPWWPQLKTQQDLIKIVTTIMWVSSGHHSAVNFGQYDFAGYFPNRPTIARTKMPNEDPTVEEWEAFINKPEDVLLDSFPTQIQATKVMSVLDILSNHSPDEEYIGSTMEPAWAADPAIKAAFEEFSGGLKELEGIIDLRNNDPNLRNRSGAGLLPYELFKPFSEPGVTGKGVPYSISI

grep ">" scalesia_subgenomeA.fa
scalesia_subgenomeA.fa:>ScDrF4C_6
scalesia_subgenomeA.fa:>ScDrF4C_1633
scalesia_subgenomeA.fa:>ScDrF4C_1634
scalesia_subgenomeA.fa:>ScDrF4C_9
scalesia_subgenomeA.fa:>ScDrF4C_4
scalesia_subgenomeA.fa:>ScDrF4C_25
scalesia_subgenomeA.fa:>ScDrF4C_21
scalesia_subgenomeA.fa:>ScDrF4C_19
scalesia_subgenomeA.fa:>ScDrF4C_10
scalesia_subgenomeA.fa:>ScDrF4C_17
scalesia_subgenomeA.fa:>ScDrF4C_14
scalesia_subgenomeA.fa:>ScDrF4C_24
scalesia_subgenomeA.fa:>ScDrF4C_30
scalesia_subgenomeA.fa:>ScDrF4C_5
scalesia_subgenomeA.fa:>ScDrF4C_18
scalesia_subgenomeA.fa:>ScDrF4C_15
scalesia_subgenomeA.fa:>ScDrF4C_16

ShiuLab / PseudogenePipeline

Formatting files #4