Error running mcclintock.py with testdata from download_test_data.py

maxcoenen commented 2 years ago

I installed McClintock and all dependencies in a conda environment (on the cloud; databricks) and tried the pipeline with the test dataset acquired from download_test_data.py. I got the following error:

/databricks/driver/mcclintock/test/SRR800842_1.fastq.gz already exists...skipping...
/databricks/driver/mcclintock/test/SRR800842_2.fastq.gz already exists...skipping...
SETUP            checking fasta: /databricks/driver/mcclintock/test/sacCer2.fasta
SETUP            checking fastq: /databricks/driver/mcclintock/test/SRR800842_1.fastq.gz
SETUP            checking fastq: /databricks/driver/mcclintock/test/SRR800842_2.fastq.gz
SETUP            checking fasta: /databricks/driver/mcclintock/test/sac_cer_TE_seqs.fasta
SETUP            checking locations gff: /databricks/driver/mcclintock/test/reference_TE_locations.gff
SETUP            checking taxonomy TSV: /databricks/driver/mcclintock/test/sac_cer_te_families.tsv
SETUP            McClintock Version: d2b819a18b2a549be483fdcc948e1346e589a4cb
PROCESSING       making coverage fasta
PROCESSING       coverage fasta created
PROCESSING       making consensus fasta
PROCESSING       consensus fasta created
PROCESSING       making reference fasta
PROCESSING       reference fasta created
PROCESSING       formatting the name of consensus TE fasta headers for compatibility with relocaTE
PROCESSING       relocaTE consensus fasta created
PROCESSING       making reference TE annotations
PROCESSING       reference TE annotations created
PROCESSING       creating 2bit file from reference genome fasta &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       reference 2bit file created
PROCESSING       adding fake chromosomes if chrom # < 5, required by TE-locate
PROCESSING       TE-locate reference created
PROCESSING       making popoolationTE annotation files
PROCESSING       popoolationTE annotation files created
PROCESSING       creating relocaTE reference TE gff
PROCESSING       relocaTE reference TE gff created
PROCESSING       masking reference fasta &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       reference fasta masked
PROCESSING       making reference TE bed file
PROCESSING       reference TE bed file created
PROCESSING       making samtools and bwa index files for reference fasta &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       samtools and bwa index files for reference fasta created
PROCESSING       making reference TE fasta &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       reference TE fasta created
PROCESSING       making TE-locate taxonomy file &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       TE-locate taxonomy file created
PROCESSING       making PopoolationTE reference fasta
PROCESSING       PopoolationTE reference fasta created
PROCESSING       Running RepeatMasker &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       Repeatmasker complete
PROCESSING       creating ngs_te_mapper2 reference TE gff
PROCESSING       ngs_te_mapper2 reference TE gff created
Searching 8 files..
PROCESSING       prepping reads for McClintock
PROCESSING       running trim_galore &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/trimgalore.log
PROCESSING       read setup complete
PROCESSING       mapping reads to reference &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/bwa.log
PROCESSING       read mapping complete
PROCESSING       sorting SAM file for compatibility with TE-locate &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       TE-locate SAM created
RELOCATE         running RelocaTE &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/relocaTE.log
RELOCATE         RelocaTE run complete
RELOCATE         processing RelocaTE results
RELOCATE         RelocaTE postprocessing complete
TEFLON           setting up for TEFLoN
TEFLON           setup for TEFLoN complete
PROCESSING       Converting sam to bam &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/processing.log
PROCESSING       sam to bam converted
PROCESSING       calculating median insert size of reads
PROCESSING       median insert size of reads calculated
TE-LOCATE        running TE-Locate &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/te-locate.log
TE-LOCATE        TE-Locate complete
TE-LOCATE        processing TE-Locate results
TE-LOCATE        TE-Locate post processing complete
RETROSEQ         running RetroSeq &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/retroseq.log
RETROSEQ         RetroSeq complete
RETROSEQ         processing RetroSeq results
RETROSEQ         RetroSeq post processing complete
TEBREAK          running tebreak
TEBREAK          tebreak run complete
TEBREAK          running tebreak post processing
TEBREAK          tebreak postprocessing complete
TEMP2            running TEMP2 Module
TEMP2            running TEMP2 non-reference insertion prediction &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/temp2.log
TEMP2            running TEMP2 non-reference absence prediction &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/temp2.log
TEMP2            TEMP2 run complete
TEMP2            running TEMP2 post processing
POPOOLATIONTE    running PopoolationTE preprocessing steps
POPOOLATIONTE    formatting read names
POPOOLATIONTE    indexing popoolationTE reference fasta &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/popoolationTE.log
POPOOLATIONTE    mapping fastq1 reads &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/popoolationTE.log
POPOOLATIONTE    mapping fastq2 reads &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/popoolationTE.log
POPOOLATIONTE    combining alignments &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/popoolationTE.log
POPOOLATIONTE    sorting sam file &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/popoolationTE.log
POPOOLATIONTE    PopoolationTE preprocessing complete
COVERAGE         Running RepeatMasker &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         augmenting reference genome
COVERAGE         samtools and bwa indexing reference &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         samtools and bwa indexing reference &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         mapping reads to augmented reference genome &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         converting SAM to BAM, and indexing &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         creating BED file of non-TE regions &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         converting repeatmasker GFF to BED
COVERAGE         determining the coverage depth of the genome &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         creating TE depth coverage table &> /databricks/driver/mcclintock/test_mcclintock_driver/logs/20220819.110033.6425337/coverage.log
COVERAGE         creating TE coverage plots
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY1.png
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY2.png
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY3.png
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY3_1p.png
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY4.png
COVERAGE         plot created: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1//results/coverage/plots/TY5.png
Job counts:
    count   jobs
    1   coverage
    1   index_reference_genome
    1   make_consensus_fasta
    1   make_coverage_fasta
    1   make_popoolationte_annotations
    1   make_ref_te_bed
    1   make_ref_te_fasta
    1   make_reference_fasta
    1   make_te_annotations
    1   map_reads
    1   mask_reference_fasta
    1   median_insert_size
    1   ngs_te_mapper2_post
    1   ngs_te_mapper2_pre
    1   ngs_te_mapper2_run
    1   ngs_te_mapper_post
    1   ngs_te_mapper_run
    1   popoolationTE2_post
    1   popoolationTE2_preprocessing
    1   popoolationTE2_run
    1   popoolationTE_post
    1   popoolationTE_preprocessing
    1   popoolationTE_ref_fasta
    1   popoolationTE_run
    1   process_temp
    1   process_temp2
    1   reference_2bit
    1   relocaTE2_post
    1   relocaTE2_run
    1   relocaTE_consensus
    1   relocaTE_post
    1   relocaTE_ref_gff
    1   relocaTE_run
    1   repeatmask
    1   retroseq_post
    1   retroseq_run
    1   run_temp
    1   run_temp2
    1   sam_to_bam
    1   setup_reads
    1   summary_report
    1   tebreak_post
    1   tebreak_run
    1   teflon_post
    1   teflon_preprocessing
    1   teflon_run
    1   telocate_post
    1   telocate_ref
    1   telocate_run
    1   telocate_sam
    1   telocate_taxonomy
    51
[WARNING]         multiqc : MultiQC Version v1.12 now available!
[INFO   ]         multiqc : This is MultiQC v1.9 (d2b819a)
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/trimgalore
[INFO   ]        cutadapt : Found 2 reports
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
Environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Traceback (most recent call last):
  File "/databricks/driver/mcclintock/test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py", line 140, in <module>
    main()
  File "/databricks/driver/mcclintock/test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py", line 33, in main
    insertions = read_insertions(insert_bed, sample_name, chromosomes, config)
  File "/databricks/driver/mcclintock/test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py", line 61, in read_insertions
    insert.start = int(split_line[1])+1
ValueError: invalid literal for int() with base 10: 'Start'
[Fri Aug 19 12:59:45 2022]
Error in rule process_temp2:
    jobid: 41
    output: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/SRR800842_1_temp2_nonredundant.bed
    conda-env: /databricks/driver/mcclintock/install/envs/conda/c5e711caeae2345377b69589ff64a24a

RuleException:
CalledProcessError in line 56 of /databricks/driver/mcclintock/snakefiles/temp2.snakefile:
Command 'source /databricks/conda/bin/activate '/databricks/driver/mcclintock/install/envs/conda/c5e711caeae2345377b69589ff64a24a'; set -euo pipefail;  python /databricks/driver/mcclintock/test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py' returned non-zero exit status 1.
  File "/databricks/conda/envs/mcclintock/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2326, in run_wrapper
  File "/databricks/driver/mcclintock/snakefiles/temp2.snakefile", line 56, in __rule_process_temp2
  File "/databricks/conda/envs/mcclintock/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 568, in _callback
  File "/databricks/conda/envs/mcclintock/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  File "/databricks/conda/envs/mcclintock/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 554, in cached_or_run
  File "/databricks/conda/envs/mcclintock/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
Exiting because a job execution failed. Look above for error message
snakemake --use-conda --conda-prefix /databricks/driver/mcclintock/install/envs/conda --configfile /databricks/driver/mcclintock/test_mcclintock_driver/snakemake/config/config_6425337.json --cores 8 --quiet /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/ngs_te_mapper/SRR800842_1_ngs_te_mapper_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/ngs_te_mapper2/SRR800842_1_ngs_te_mapper2_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/relocate/SRR800842_1_relocate_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/relocate2/SRR800842_1_relocate2_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp/SRR800842_1_temp_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/SRR800842_1_temp2_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/retroseq/SRR800842_1_retroseq_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/popoolationte/SRR800842_1_popoolationte_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/popoolationte2/SRR800842_1_popoolationte2_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/te-locate/SRR800842_1_telocate_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/teflon/SRR800842_1_teflon_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/coverage/te_depth.csv /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/fastq/SRR800842_1_1.fq /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/mapped_reads/median_insert.size /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/tebreak/SRR800842_1_tebreak_nonredundant.bed /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/summary/data/run/summary_report.txt

When investigating further, I tried to locate the .../test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py script, yet did not find it, I assume it is a temporary generated python script. Yet when I investigated .../test_mcclintock_driver/logs/20220819.110033.6425337/temp2.log I found the following:

2bit: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/genome_fasta/sacCer2.aug.fasta.2bit
consensus fasta: /databricks/driver/mcclintock/test_mcclintock_driver/sacCer2/consensus_fasta/consensusTEs.fasta
reference TE BED: /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/reference_te_locations/sacCer2.ref.TEs.bed
Taxonomy TSV: /databricks/driver/mcclintock/test_mcclintock_driver/sacCer2/te_taxonomy/taxonomy.tsv
Testing required softwares:
bwa: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/bwa
samtools: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/samtools
bedtools: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/bedtools
bedops: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/bedops
------ Start pipeline ------
get concordant-uniq-split reads Thu Aug 18 12:46:18 UTC 2022
awk: line 2: function and never defined
awk: line 2: function and never defined
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
check fragment length   Thu Aug 18 12:47:50 UTC 2022
awk: line 2: function and never defined
Error in read.table(Args[6], header = F, row.names = NULL) : 
  no lines available in input
Execution halted
/databricks/driver/mcclintock/install/tools/temp2/bin/TEMP2_insertion.sh: line 170: [: -gt: unary operator expected
get mate seq of the uniq-unpaired   Thu Aug 18 12:47:50 UTC 2022
awk: line 2: function and never defined
[main_samview] fail to read the header from "-".
samtools sort: failed to read header from "-"
[E::hts_open_format] Failed to open file "SRR800842_1.unpair.uniq.sortByName.bam" : No such file or directory
samtools bam2fq: Cannot read file "SRR800842_1.unpair.uniq.sortByName.bam": No such file or directory
[main_samview] fail to read the header from "-".
map paired split uniqMappers and unpaired uniqMappers to transposons    Thu Aug 18 12:48:01 UTC 2022
[main_samview] fail to read the header from "SRR800842_1.unpair.uniq.transposon.sam".
merge fragments in genome and transposon    Thu Aug 18 12:48:05 UTC 2022
awk: line 40: function and never defined
awk: line 40: function and never defined
awk: line 78: function and never defined
awk: line 78: function and never defined
awk: line 78: function and never defined
SRR800842_1.t is empty
merge support reads in the same direction within 302 -  Thu Aug 18 12:48:05 UTC 2022
expr: syntax error: missing argument after ‘-’
/databricks/driver/mcclintock/install/tools/temp2/bin/TEMP2_insertion.sh: line 206: [: -lt: unary operator expected
sort: cannot read: 'SRR800842_1.supportReads/*.bed': No such file or directory

*****
***** ERROR: Requested column 4, but database file - only has fields 1 - 0.
Traceback (most recent call last):
  File "/databricks/driver/mcclintock/install/tools/temp2/bin/processMergedBed.py", line 108, in <module>
    main()
  File "/databricks/driver/mcclintock/install/tools/temp2/bin/processMergedBed.py", line 11, in main
    ins = int(sys.argv[3])
IndexError: list index out of range
merge support reads in different direction within 2 X 302 -     Thu Aug 18 12:48:05 UTC 2022
filter candidate insertions which overlap with the same transposon insertion or in high depth region    Thu Aug 18 12:48:05 UTC 2022
filter candidate insertions in high depth region    Thu Aug 18 12:48:05 UTC 2022
average read number for 200bp bins is 0, set read number cutoff to 0
***** WARNING: File - has inconsistent naming convention for record:
2micron 0   99  SRR800842.86499/1   60  +

***** WARNING: File - has inconsistent naming convention for record:
2micron 0   99  SRR800842.86499/1   60  +

Filtered insertion number: 0 - 0 (overlap rmsk) 0 (short insertion) - 0 (high depth) = 0
generate the overall distribution of transposon mapping reads, first map all reads to transposon    Thu Aug 18 12:50:15 UTC 2022
sam to bed and bedGraph, multiple mappers are divided by their map times    Thu Aug 18 12:58:14 UTC 2022
[bam_sort_core] merging from 0 files and 3 in-memory blocks...
estimate de novo insertion number for each transposon using singleton reads Thu Aug 18 12:58:35 UTC 2022
Error in `[.data.frame`(soma, , 4) : undefined columns selected
Calls: [ -> [.data.frame
Execution halted
generate distribution figures for singleton supporting reads    Thu Aug 18 12:58:35 UTC 2022
Error in read.table(Args[6], header = F, row.names = NULL) : 
  no lines available in input
Execution halted
filter unreliable singleton insertions, also filter 2p insertions overlapped with similar reference transposon copies   Thu Aug 18 12:58:36 UTC 2022
Error: unable to open file or unable to determine types for file SRR800842_1.tmp

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the 
  expected columns (e.g., cols 2 and 3 for BED).
Error: unable to open file or unable to determine types for file SRR800842_1.tmp

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the 
  expected columns (e.g., cols 2 and 3 for BED).
Calculate frequency of each transposon insertion    Thu Aug 18 12:58:36 UTC 2022
[bam_sort_core] merging from 0 files and 3 in-memory blocks...
Error: unable to open file or unable to determine types for file SRR800842_1.tmp

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the 
  expected columns (e.g., cols 2 and 3 for BED).
get TSD, remove redundant insertions and recalculate de novo insertion rate Thu Aug 18 12:58:36 UTC 2022
awk: line 6: regular expression compile failed (missing operand)
|

*****
***** ERROR: Requested column 2, but database file - only has fields 1 - 0.
Expecting number field 3 line 3 of SRR800842_1.t, got ;;
calculate de novo insertion rate per genome Thu Aug 18 12:58:36 UTC 2022
clean tmp files Thu Aug 18 12:58:36 UTC 2022
Done, Congras!!!🍺🍺🍺
bash /databricks/driver/mcclintock/install/tools/temp2//TEMP2 insertion -l /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/fastq/SRR800842_1_1.fq -r /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/fastq/SRR800842_1_2.fq -i /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/mapped_reads/SRR800842_1.sorted.bam -g /databricks/driver/mcclintock/test_mcclintock_driver/sacCer2/genome_fasta/sacCer2.fasta -R /databricks/driver/mcclintock/test_mcclintock_driver/sacCer2/consensus_fasta/consensusTEs.fasta -t /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/reference_te_locations/sacCer2.ref.TEs.bed -c 3 -f 302 -o /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/unfiltered/ -M 2 -m 5 -U 0.8 -N 300
Testing required softwares/scripts:
"echo" is using: /usr/bin/echo
"rm" is using: /usr/bin/rm
"mkdir" is using: /usr/bin/mkdir
"date" is using: /usr/bin/date
"mv" is using: /usr/bin/mv
"sort" is using: /usr/bin/sort
"touch" is using: /usr/bin/touch
"awk" is using: /usr/bin/awk
"grep" is using: /usr/bin/grep
"twoBitToFa" is using: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/twoBitToFa
"bwa" is using: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/bwa
"samtools" is using: /databricks/driver/mcclintock/install/envs/conda/93485968c87cc0cd0ea79dc3c191d747/bin/samtools
Done with testing required softwares/scripts, starting pipeline...
SRR800842_1.sorted.bam
SRR800842_1
***** WARNING: File SRR800842_1.unproper.uniq.interval.bed has inconsistent naming convention for record:
2micron 38  217 SRR800842.3973792/1     0   +

***** WARNING: File SRR800842_1.unproper.uniq.interval.bed has inconsistent naming convention for record:
2micron 38  217 SRR800842.3973792/1     0   +

bash /databricks/driver/mcclintock/install/tools/temp2//TEMP2 absence -i /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/mapped_reads/SRR800842_1.sorted.bam -r /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/reference_te_locations/sacCer2.ref.TEs.bed -t /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/intermediate/genome_fasta/sacCer2.aug.fasta.2bit -f 302 -c 3 -o /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/unfiltered//absence -x {'-x': 0}
cp /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/unfiltered//absence/SRR800842_1.absence.refined.bp.summary /databricks/driver/mcclintock/test_mcclintock_driver/SRR800842_1/results/temp2/unfiltered/

So it seems that SRR800842_1.unproper.uniq.interval.bed has been formatted incorrectly, possibly by not having a valid value in the 5th column, I suppose. However, this file is temporary, and seems to have been deleted by the script after the error has occurred, so I was not able to investigate it.

I was hoping that you could help me get McClintock running, hopefully. Perhaps there is a way to still run it without temp2, or solve the issues in that specific module. Thank you!

cbergman commented 2 years ago

Hi @maxcoenen

Thanks for the bug report, this appears to be an issue with the TEMP2_insertion.sh script and the version of awk on your system. Could you type the following inside your mcclintock environment and report back what you see?

awk --version
cat /etc/os-release

We'll look into the causes of this, but in the mean time you can run McClintock with specific component methods as follows:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -m trimgalore,temp,ngs_te_mapper,retroseq \
    -o /path/to/output/directory

Just replace trimgalore,temp,ngs_te_mapper,retroseq with the components you want to execute. The full list of components is documented here: https://github.com/bergmanlab/mcclintock/#run.

Best regards, Casey

maxcoenen commented 2 years ago

The system runs with awk version 1.3.4 20200120

Ubuntu: NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

I was able to run McClintock successfully now, I also patched out fastq_info by replacing the function with true (returning 0), as I know my paired reads are alright, and fastq_info was too heavy on memory usage to successfully run the script.

maxcoenen commented 2 years ago

I really appreciate the way McClintock handles the post-processing, especially separating the reference- and non-reference TEs. That is why I am now still running McClintock, though just with the PoPoolationTE2 pipeline. I was wondering, for running mcclintock with multiple samples, can there be an implementation that the pre-processing of the reference genome (masking) can be handled only once, or be able to provide it as a McClintock argument (similar to the -s coverage fasta flag)? That way running multiple samples might save some time.

cbergman commented 1 year ago

Hi @maxcoenen

Sorry this slipped. Yes, you can re-use a prior masked reference genome produced in a preivous McClintock run as described here: https://github.com/bergmanlab/mcclintock#running-mcclintock-with-multiple-samples-using-same-reference-genome.

Hope this helps, Casey

bergmanlab / mcclintock

Error running mcclintock.py with testdata from download_test_data.py #97