bcgsc / tigmint

⛓ Correct misassemblies using linked AND long reads
https://bcgsc.github.io/tigmint/
GNU General Public License v3.0
54 stars 13 forks source link

tigmint_molecule_paf.py: TypeError: expected string or bytes-like object #68

Closed mmokrejs closed 2 years ago

mmokrejs commented 2 years ago

Does tigmint-make tigmint-long support FASTQ reads from other platforms than 10x genomics chromium?

$ bash -x tigmint.sh SCRATCH=/scratch/mmokrejs/job_3024675.cerit-pbs.cerit-sc.cz TMPDIR=/scratch/mmokrejs/job_3024675.cerit-pbs.cerit-sc.cz SORT_OPTS='-S 1G'
+ '[' -z '' ']'
+ threads=14
+ myreads=foo_PacBio_and_Nanopore.fq.gz
+ for f in foo__abyss_*long-scaffs.fa
++ basename foo__abyss_106-long-scaffs.fa .fa
+ p=foo__abyss_106-long-scaffs
+ echo 'tigmint-make tigmint-long draft=foo__abyss_106-long-scaffs.fa reads=foo_PacBio_and_Nanopore.fmlrc2.fa.gz span=auto G=6.8e9 dist=auto'
tigmint-make tigmint-long draft=foo__abyss_106-long-scaffs.fa reads=foo_PacBio_and_Nanopore.fmlrc2.fa.gz span=auto G=6.8e9 dist=auto
++ basename foo__abyss_106-long-scaffs.fa .fa
++ basename foo_PacBio_and_Nanopore.fq.gz .fq.gz
+ tigmint-make tigmint-long draft=foo__abyss_106-long-scaffs reads=foo_PacBio_and_Nanopore longmap=ont span=auto G=6.8e9 dist=auto t=14
long-to-linked-pe -l 500 -m2000 -g6.8e9 -s -b foo_PacBio_and_Nanopore.barcode-multiplicity.tsv --bx -t14 --fasta -f foo_PacBio_and_Nanopore.tigmint-long.params.tsv foo_PacBio_and_Nanopore.fq.gz | \
minimap2 -y -t14 -x map-ont --secondary=no foo__abyss_106-long-scaffs.fa - | \
tigmint_molecule_paf.py -q0 -s2000 -p foo_PacBio_and_Nanopore.tigmint-long.params.tsv - | sort -k1,1 -k2,2n -k3,3n -T /scratch/mmokrejs/job_3024675.cerit-pbs.cerit-sc.cz -S 1G > foo__abyss_106-long-scaffs.foo_PacBio_and_Nanopore.cut500.molecule.size2000.bed
long-to-linked-pe v1.0: Using more than 6 threads does not scale, reverting to 6.
[M::mm_idx_gen::84.761*1.60] collected minimizers
[M::mm_idx_gen::89.593*2.24] sorted minimizers
[M::main::89.593*2.24] loaded/built the index for 3530326 target sequence(s)
[M::mm_mapopt_update::91.203*2.22] mid_occ = 1442
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 3530326
[M::mm_idx_stat::92.061*2.21] distinct minimizers: 67761996 (20.86% are singletons); average occurrences: 10.813; average spacing: 5.460
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.9/tigmint_molecule_paf.py", line 141, in <module>
    main()
  File "/usr/lib/python-exec/python3.9/tigmint_molecule_paf.py", line 138, in main
    MolecIdentifierPaf().run()
  File "/usr/lib/python-exec/python3.9/tigmint_molecule_paf.py", line 98, in run
    self.print_new_molecule(prev_barcode, cur_intervals, out_molecules_file)
  File "/usr/lib/python-exec/python3.9/tigmint_molecule_paf.py", line 43, in print_new_molecule
    barcode_match = re.search(r'^BX:Z:(\S+)', barcode)
  File "/usr/lib/python3.9/re.py", line 201, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
^Cmake: *** Deleting file `foo__abyss_106-long-scaffs.foo_PacBio_and_Nanopore.cut500.molecule.size2000.bed'
time user=64.51s system=53.79s elapsed=565.02s cpu=20% memory=4 job=long-to-linked-pe -l 500 -m2000 -g6.8e9 -s -b  --bx -t14 --fasta -f
time user=807.08s system=66.89s elapsed=151.51s cpu=576% memory=40947 job=minimap2 -y -t14 -x map-ont --secondary=no  -
time user=0.06s system=0.05s elapsed=151.61s cpu=0% memory=12 job=tigmint_molecule_paf.py -q0 -s2000 -p  -
time user=0.00s system=0.00s elapsed=151.61s cpu=0% memory=0 job=
make: *** [foo__abyss_106-long-scaffs.foo_PacBio_and_Nanopore.cut500.molecule.size2000.bed] Interrupt
lcoombe commented 2 years ago

Hi @mmokrejs,

Just want to make sure that the distinction between the original tigmint and tigmint-long is clear - tigmint utilizes linked reads for the correction, whereas tigmint-long uses long reads. tigmint-long supports fasta or fastq reads from any long read technology. For tigmint-long we simulated 'pseudo-linked reads' from the long reads, so we make an output fasta format where the 'barcode' is in the read header, and each linked read from the same long read is given the same barcode. It looks like the error you have here is from tigmint-long, so we prescribe what the format is, as we write the file that is aligned with minimap2.

if you're still seeing this error, feel free to post the full command and log.

Thanks, Lauren

mmokrejs commented 2 years ago

Hi @lcoombe , thank you for confirming that tigmint should work with any long reads (notably PacBio and Nanopore).

Here are my read names from PacBio HiFi:

@m54312U_210604_060420/36/ccs
@m54312U_210604_060420/1836835/ccs
@m54312U_210604_060420/3671337/ccs
@m54312U_210604_060420/5572759/ccs
@m54312U_210604_060420/7603272/ccs
@m54312U_210604_060420/9570710/ccs
@m54312U_210604_060420/11601384/ccs
@m54312U_210604_060420/13632255/ccs
@m54312U_210604_060420/15664129/ccs
@m54312U_210604_060420/17695940/ccs
...

And here are the reads from Promethion:

...
@3449a3f3-3517-4e6c-a4a9-556181a35c0f runid=f449620d42d064e00e812250f447e112b6cda908 read=23410 ch=2579 start_time=2020-07-03T18:43:27Z flow_cell_id=PAE36794 protocol_group_id=200630Wen sample_id=D20-2216
@02e23ba5-b6db-4a2a-afbf-ee032d2b216d runid=f449620d42d064e00e812250f447e112b6cda908 read=24449 ch=2169 start_time=2020-07-03T18:43:54Z flow_cell_id=PAE36794 protocol_group_id=200630Wen sample_id=D20-2216

My command was

tigmint-make tigmint-long draft=`basename "$f" .fa` reads=`basename $myreads .fq.gz` longmap=ont span="auto" G="6.8e9" dist="auto" t=$threads
lcoombe commented 2 years ago

Hi @mmokrejs,

Just to confirm - is your myreads.fq.gz a mix of ONT (fasta) and pacbio (fq) reads?

Do you see any other messages before the python TypeError? That's usually indicative of an issue/error/warning that happened upstream.

mmokrejs commented 2 years ago

Sorry, I was in a hurry and picked a FASTA derivative of the file. I confirm the FASTQ file complies fastq. I edited the sequence above to show the reads at the end of the huge file from Promethion.

No, I haven't seen other error messages before that but I do remember seeing this reported by some other user as well, either under tigmint or arcs or similar bgsc project. Just poke poke through other issues on github.

lcoombe commented 2 years ago

Could you please post the full log? It's helpful for me to have the full log to better aid you in troubleshooting. I took a look through the issues on tigmint and arcs, but couldn't find a similar issue - if you recall where you saw that and link it here that would be helpful! As I mentioned, I have seen this error before, but generally after an upstream failure, so that's just why I'd like to have a second glance at your full log.

mmokrejs commented 2 years ago

OK, luckily I found it in the xterm buffer so I edited the orignal post. Yeah, will try to find the similar issue on github ..., sorry that did not keep the link to it handy.

lcoombe commented 2 years ago

Thanks for the full log!

This is a funny one - I'm not too sure if the issue is happening in the long-to-linked-pe stage or at the tigmint_molecule_paf.py stage. Just checking - have you tried running your installation on our installation test files? (https://github.com/bcgsc/tigmint/tree/master/tests/test_installation) It's just a good sanity check for us to see if it is an installation/environment issue or something with your particular data.

mmokrejs commented 2 years ago

I don't think I ran the tigmint tests even after brief xterm buffer poking.

But based on this python error speaking of BX:Z tags I moved to ntLink and those tests passed successfully.

You know, based on this crash I though tigmint actually requires the BX:Z tags from 10x chromium, even in the tigmint-long mode and that I just somehow misread the README.md. Seemed ntLink is the way to go for me.

mmokrejs commented 2 years ago

After the crash something was still occupying the shell, hence the ctrl+c to quit that. Maybe it was some parallel job started by the tigmint-make. Seemed obvious I anyway have to overcome the python crash. Probably crash on an empty or None value, would not be surprising as I do not have 10x chromium data.

lcoombe commented 2 years ago

Sure, if you want to continue troubleshooting the TypeError, let me know how running the tests goes when you get around to it.

And just to clarify (also for the benefit of others), tigmint-long does use BX tags, but only under the hood, so you wouldn't notice that without the error. While the input of tigmint-long is normal long reads, it simulates 'pseudo-linked reads' from the long reads using long-to-linked-pe, so it generates the reads with the BX tag in the header. So looking at the entire command of that step is helpful in understanding what's going on.

I'm glad you've found ntLink, and (as you may have noticed), Tigmint-long + ntLink is in fact the default mode of our longstitch pipeline.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your interest in Tigmint!