Closed maxcoenen closed 1 year ago
Hi @maxcoenen
Thanks for the bug report, this appears to be an issue with the TEMP2_insertion.sh
script and the version of awk on your system. Could you type the following inside your mcclintock environment and report back what you see?
awk --version
cat /etc/os-release
We'll look into the causes of this, but in the mean time you can run McClintock with specific component methods as follows:
python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 4 \
-m trimgalore,temp,ngs_te_mapper,retroseq \
-o /path/to/output/directory
Just replace trimgalore,temp,ngs_te_mapper,retroseq
with the components you want to execute. The full list of components is documented here: https://github.com/bergmanlab/mcclintock/#run.
Best regards, Casey
The system runs with awk version 1.3.4 20200120
Ubuntu: NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
I was able to run McClintock successfully now, I also patched out fastq_info by replacing the function with true (returning 0), as I know my paired reads are alright, and fastq_info was too heavy on memory usage to successfully run the script.
I really appreciate the way McClintock handles the post-processing, especially separating the reference- and non-reference TEs. That is why I am now still running McClintock, though just with the PoPoolationTE2 pipeline. I was wondering, for running mcclintock with multiple samples, can there be an implementation that the pre-processing of the reference genome (masking) can be handled only once, or be able to provide it as a McClintock argument (similar to the -s coverage fasta flag)? That way running multiple samples might save some time.
Hi @maxcoenen
Sorry this slipped. Yes, you can re-use a prior masked reference genome produced in a preivous McClintock run as described here: https://github.com/bergmanlab/mcclintock#running-mcclintock-with-multiple-samples-using-same-reference-genome.
Hope this helps, Casey
I installed
McClintock
and all dependencies in a conda environment (on the cloud; databricks) and tried the pipeline with the test dataset acquired fromdownload_test_data.py
. I got the following error:When investigating further, I tried to locate the
.../test_mcclintock_driver/snakemake/6425337/.snakemake/scripts/tmpz0b0oixt.temp2_post.py
script, yet did not find it, I assume it is a temporary generated python script. Yet when I investigated.../test_mcclintock_driver/logs/20220819.110033.6425337/temp2.log
I found the following:So it seems that
SRR800842_1.unproper.uniq.interval.bed
has been formatted incorrectly, possibly by not having a valid value in the 5th column, I suppose. However, this file is temporary, and seems to have been deleted by the script after the error has occurred, so I was not able to investigate it.I was hoping that you could help me get
McClintock
running, hopefully. Perhaps there is a way to still run it withouttemp2
, or solve the issues in that specific module. Thank you!