ParkinsonLab / MetaPro

GNU General Public License v3.0
18 stars 3 forks source link

Blat doesn't look for correct extension #9

Closed Jeltje closed 3 years ago

Jeltje commented 3 years ago
    if mp_util.check_bypass_log(output_folder_path, GA_BLAT_label):
        marker_path_list = []
        for split_sample in os.listdir(os.path.join(GA_BWA_path, "final_results")):
            if(split_sample.endswith(".fasta")):   

... but the BWA files end witn .fna

billytaj commented 3 years ago

this piece of code checks for the fastas generated after we put the BWA results through the post-processing script. The BLAT step wouldn't be looking for .fna files. (Those are the genes we found with the reads, using BWA) BLAT is used to scan the leftover reads BWA didn't get to.

I suspect there's an issue with the bypass log skipping over a step. At the end of BWA_pp, you're supposed to get 3 files per fastq chunk: -> a gene_map.tsv -> a mapped_genes.fna -> a fasta of leftover reads not annotated by BWA (or they did but didn't pass quality control criteria) In the latest feature, I attempted to merge the fastas together to reduce the number of files.

-> Try editing the bypass_log.txt and removing the line that says: "GA_BWA_pp" and run the pipe again

Jeltje commented 3 years ago

I did, same results.

There are files in GA_BWA/data/2_bwa_pp/ (e.g. singletons_11_chocophlan_chunk_5.fasta) And there are files like these in GA_BWA/final_results

contigs_0_chocophlan_chunk_0_gene_map.tsv
contigs_0_chocophlan_chunk_0_mapped_genes.fna

There are many scripts with names like BWA_pp_singletons_45_chocophlan_chunk_9.sh directly under GA_BWA and their contents look like (I added linebreaks for clarity):

python3 /pipeline/Scripts/ga_BWA_generic_v2.py
90
/mydata/indices/chocophlan_h3_chunks/chocophlan_chunk_9.fasta
/mydata/testout/assemble_contigs/final_results/contig_map.tsv
/mydata/testout/GA_BWA/final_results/singletons_45_chocophlan_chunk_9_gene_map.tsv
/mydata/testout/GA_BWA/final_results/singletons_45_chocophlan_chunk_9_mapped_genes.fna
/mydata/testout/GA_BWA/data/0_read_split/singletons/singletons_45.fastq
/mydata/testout/GA_BWA/data/1_bwa/singletons_45_chocophlan_chunk_9.sam
/mydata/testout/GA_BWA/data/2_bwa_pp/singletons_45_chocophlan_chunk_9.fasta
&&
>&2
echo bwa pp complete: singletons_45_chocophlan_chunk_9_bwa_pp |
touch
/mydata/testout/GA_BWA/data/jobs/singletons_45_chocophlan_chunk_9_bwa_pp

Note that there are no references to fasta files in /mydata/testout/GA_BWA/final_results/ in the command, only to .fna and .tsv files.

If there's a cleanup function that's supposed to move them from GA_BWA/data/2_bwa_pp, it's not working.

billytaj commented 3 years ago

Managed to recreate the problem. Issue was with a if-clause that shouldn't have run with the new chocophlan. Created empty fastas, which caused segfaults in BLAT, but still kept running.

issue patched, but a more robust solution to check for failures is something to consider in V2