Errors about running straglr to call tandem repeats

LiShuhang-gif commented 2 years ago

Hi, I'm trying to use Straglr to call tandem repeat. After the program finished, I've got the ins_merged.bed and tsv files. However, when I checked the log file, I found some error messages:

problem getting seq1 m64061_210713_112710/160105725/ccs ['chr19', 22927574, 22927624, 'AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTC,AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTC,CCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCCCGCCCGGCCAGCCG,CCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTCAGCCCCCCTGC,GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCC'] None 22927724 None
problem getting seq1 m64031_210322_010222/106758944/ccs ['chr19', 22927574, 22927624, 'AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTC,AGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTC,AGCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTC,CCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCCCGCCCGGCCAGCCG,CCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTGTCAGCCCCCCTGC,GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGGTCAGCCCCC'] None 22927724 None

In another unfinished program, the error message is as follows:

trf input /tmp/tmpw7x9ozqo
can't generate temp file: /tmp/tmpz0dufyxc
gg 894074
ins all /tmp/tmpjl9mcqu3
gg 865674
trf input /tmp/tmpwcix8m7s
can't generate temp file: /tmp/tmp56_xzfhi
trf input /tmp/tmps9o2l3sh
can't generate temp file: /tmp/tmp08g0hvzk
trf input /tmp/tmp0h0uwas1
can't generate temp file: /tmp/tmpg0yd059k
trf input /tmp/tmpenawf5nt
can't generate temp file: /tmp/tmplzox77_x
trf input /tmp/tmp33jk6v2k
can't generate temp file: /tmp/tmppamp3_4z
trf input /tmp/tmp_n6qb_ks
can't generate temp file: /tmp/tmpkz_d23h3
trf input /tmp/tmpdor68mgx
can't generate temp file: /tmp/tmph1u5397w

I wonder if this will affect the accuracy and credibility of the resulting files. And I want to know how to solve these error messages. Any suggestions? Thanks ever so much!

LiShuhang-gif commented 2 years ago

I've tried the --reads_fasta option with a compressed fastq file, which leads to another error as follows:

zz 3103
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/src/ins.py", line 149, in examine_regions
    ins_list.extend(self.examine_region(region, bam=bam, reads_fasta=reads_fasta))
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/src/ins.py", line 183, in examine_region
    ins_list.extend(self.extract_ins(aln, region, reads_fasta=reads_fasta))
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/src/ins.py", line 360, in extract_ins
    ins[7] = INS.extract_neighbour_seqs(self.get_seq(reads_fasta, aln.query_name, aln.is_reverse), rpos, len(ins_seq), self.w)
TypeError: object of type 'NoneType' has no len()
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/fan_lab/shali/yes/bin/straglr.py", line 77, in <module>
    main()
  File "/public/home/fan_lab/shali/yes/bin/straglr.py", line 62, in main
    ins = ins_finder.find_ins()
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/src/ins.py", line 80, in find_ins
    batched_results = parallel_process(self.examine_regions, batches, self.nprocs)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/src/utils.py", line 20, in parallel_process
    results = p.map(func, args)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pathos/multiprocessing.py", line 139, in map
    return _pool.map(star(f), zip(*args)) # chunksize
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/multiprocess/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/multiprocess/pool.py", line 657, in get
    raise self._value
TypeError: object of type 'NoneType' has no len()

readmanchiu commented 2 years ago

HI @LiShuhang-gif Thanks for trying Straglr. Would you mind re-cloning the current Straglr repo to see if you get any results? There is a new option --tmpdir where you can specify the tmpdir location. From the error messages it seems like your tmp space is used up. You need to find a location big enough so that temporary files can be generated by Straglr, and specify the location using --tmpdir. The --reads_fasta has been tested, as long as the fastq sequences has been indexed by tabix it should be accessible by pysam. But they have to be bgzipped to be indexable by tabix.

LiShuhang-gif commented 2 years ago

Hello, thanks for your reply. But, as far as I know, it seems that tabix can't be used to index fastq files since there is no fastq in tabix -p, which is used to specify the file type.

tabix: option requires an argument -- 'p'

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r1005)

Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]

Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]

Can you show me how you handle fastq files, preferably with a specific Linux command line? Thank you very much!

readmanchiu commented 2 years ago

sorry it should be samtools not tabix, samtools faidx or samtools fqidx

LiShuhang-gif commented 2 years ago

Hello, I have already specified the path for tmp files using --tmpdir option, used bgzip to compress and used samtools fqidx to index my fastq files. However, it seems like that a new error message has turned out as follows:

Traceback (most recent call last):
  File "/public/home/fan_lab/shali/yes/bin/straglr.py", line 4, in <module>
    __import__('pkg_resources').run_script('straglr==1.2.0', 'straglr.py')
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1455, in run_script
    exec(script_code, namespace, namespace)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/straglr-1.2.0-py3.7.egg/EGG-INFO/scripts/straglr.py", line 80, in <module>
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/straglr-1.2.0-py3.7.egg/EGG-INFO/scripts/straglr.py", line 50, in main
TypeError: __init__() got an unexpected keyword argument 'min_cluster_size'

The script I used is as follows:

straglr.py C1.sort.filter.bam hg38_22_XYM.fa straglr_scan_min_ins20.tsv \
  --min_ins_size 20 \
  --genotype_in_size \
  --min_support 2 \
  --nprocs 16 \
  --tmpdir /public/home/fan_lab/shali/VNTR/Straglr/C1_ins20/tmp \
  --reads_fasta ../combined_C1.fq.gz

According to the error message, it seems that there is something wrong with min_cluster_size, but I didn't set this parameter in my script. Any suggestion about solving this error? I'll try anything you suggest right away. Thanks again!

readmanchiu commented 2 years ago

seems like you are not running the latest code (in the src directory), because min_cluster_size is a newly-added parameter and the error message said it's not recognized. Why don't you test running the little test data I've put up in the test directory and see you can get the expected output (genome_scan.*). The command is simply:

straglr.py test.bam /your/path/to/hg38.fa your_output_prefix

Note there will be 2 files generated, one a bed file without all the read names and details, and the other the old tsv file. So for the third parameter in running Straglr you should specify the output prefix without the tsv extension.

LiShuhang-gif commented 2 years ago

Hi, thanks for your prompt reply! Can you tell me how to update Straglr to the latest version? The following command does not seem to work on my server

pip install git+https://github.com/bcgsc/straglr.git#egg=straglr

When I run this command, I get some error messages:

(base) [shali@vm-login01 biosoft]$ pip install git+https://github.com/bcgsc/straglr.git#egg=straglr
Collecting straglr
  Cloning https://github.com/bcgsc/straglr.git to /tmp/pip-install-4ctnhlfw/straglr_9e151966f37b49ff99dd3e0885f5c427
  Running command git clone -q https://github.com/bcgsc/straglr.git /tmp/pip-install-4ctnhlfw/straglr_9e151966f37b49ff99dd3e0885f5c427
  fatal: unable to access 'https://github.com/bcgsc/straglr.git/': OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443
WARNING: Discarding git+https://github.com/bcgsc/straglr.git#egg=straglr. Command errored out with exit status 128: git clone -q https://github.com/bcgsc/straglr.git /tmp/pip-install-4ctnhlfw/straglr_9e151966f37b49ff99dd3e0885f5c427 Check the logsfor full command output.
ERROR: Could not find a version that satisfies the requirement straglr (unavailable)
ERROR: No matching distribution found for straglr (unavailable)

So I tried downloading the ZIP and unzipping it. Then I ran the following command:

python setup.py build
python setup.py install

I think the installation seems to be successful. But according to what you said, this is not the latest version and I would like to know how to update Straglr to the latest version. Thanks again!

LiShuhang-gif commented 2 years ago

My version of Straglr is 1.2.0, which is also shown in the error message. The same error occurred when I ran the test data.

(base) [shali@vm-login02 test]$ straglr.py test.bam /public/home/fan_lab/shali/reference/hg38_22_XYM.fa ./bam/try
Traceback (most recent call last):
  File "/public/home/fan_lab/shali/yes/bin/straglr.py", line 4, in <module>
    __import__('pkg_resources').run_script('straglr==1.2.0', 'straglr.py')
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1455, in run_script
    exec(script_code, namespace, namespace)
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/straglr-1.2.0-py3.7.egg/EGG-INFO/scripts/straglr.py", line 80, in <module>
  File "/public/home/fan_lab/shali/yes/lib/python3.7/site-packages/straglr-1.2.0-py3.7.egg/EGG-INFO/scripts/straglr.py", line 50, in main
TypeError: __init__() got an unexpected keyword argument 'min_cluster_size'

I don't know if it was my installation method, python setup.py build and python setup.py install , that leads to this error.

LiShuhang-gif commented 2 years ago

After changing HTTPS to HTTP, I successfully installed Straglr. I tried to run the data in test directory and got the same results as in genome_scan.tsv. Currently, I'm trying to run Straglr on my own data, using both --reads_fasta and --tmpdir parameters. Thank again! It was very helpful to me. I'll contact you if I have any questions.

readmanchiu commented 2 years ago

Glad to hear that it's working now, at least for the test data :) Let me know if there is any issues. BTW, --min_ins 50 may be a bit too low, not sure if you are analyzing PacBio CCS/HiFi or Nanopore reads, but Nanopore reads are more noisy you may get many 50bp insertions that may not be real.

LiShuhang-gif commented 2 years ago

Yeah, but the results turned out to be a little different than I expected. With the --reads_fasta and --tmpdir option, Straglr gets less tandem repeats (I thought it would get more tandem repeats with fastq file). I got 7821 tandem repeats without fastq files, while only 7,316 results when providing fastq files. And the log file is empty. I wonder if this is normal? Why did Straglr end up getting fewer tandem repeats with fastq file? Was it more rigorously validating the Tandem Repeats it found? Thanks!

readmanchiu commented 2 years ago

There should be only 13 reads that carry an expansion at the ATXN10 locus in the test data, with running Straglr using the default parameters. Did you run with some different parameters? The current version will not output any messages to stdout, it will only do so if you run with --debug, which will not take care of getting rid of the temporary files if you --debug is turned on (you will have to remove them manually).

LiShuhang-gif commented 2 years ago

Hi, the results I mentioned above are based on my own data, not the test data. I thought offering fastq would enable Straglr to find more tandem Repeats Loci. But the reality is that there are fewer, which really confused me. By the way, given that the current version will not output any messages to stdout, I will not know if an error is reported unless adding the --debug option? Thanks!

readmanchiu commented 2 years ago

Yes, have to turn on --debug to see the warning messages. The numbers you reported are the numbers of loci, not the numbers of reads, right? There is a potential glitch in using the read sequences I just thought of now. I'll need to check that. So I suggest taking the bam file results as they are for now.

LiShuhang-gif commented 2 years ago

Yes, the numbers I reported are the numbers of loci, not the numbers of reads. Unfortunately, with the debug option, the error message appears again.

trf input /public/home/fan_lab/shali/VNTR/Straglr/C1_ins20_fasta_debug/tmp/tmpdajybv79
problem getting seq1 m64030_210322_005835/42141537/ccs ['chr16', 34584353, 34584354, 'AATGGAATCATCATCGAATGGAATCG,ATCGAATGGACTCGAATGGAATCATCATCGAATGGAATCGAATGGAATC,CGAATGGAAACATCATCAATGGAAT,CGAATGGAATCATCATCGAATGGAAT,CGAATGGAATCGAATGGAATCACAT'] None None None
problem getting seq1 m64031_210323_071702/119540563/ccs ['chr16', 34584353, 34584354, 'AATGGAATCATCATCGAATGGAATCG,ATCGAATGGACTCGAATGGAATCATCATCGAATGGAATCGAATGGAATC,CGAATGGAAACATCATCAATGGAAT,CGAATGGAATCATCATCGAATGGAAT,CGAATGGAATCGAATGGAATCACAT'] None None None

Given that this seems to be a warning message rather than an error message, I wonder if it might adversely affect the results? Thanks!

readmanchiu commented 2 years ago

Yes, these messages come up whenever the the read coordinates that the script comes up with do not lead to successful extraction of the subsequence, usually as a result of some possibly split alignments

bcgsc / straglr

Errors about running straglr to call tandem repeats #5