EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
92 stars 18 forks source link

serialise load BLAST data error (Cannot use a compiled regex as replacement pattern with regex=False) #457

Open jolbi opened 3 months ago

jolbi commented 3 months ago

Hi,

When running mikado serialise with BLAST tsv results I get error: Cannot use a compiled regex as replacement pattern with regex=False

Serialise log:

2024-06-13 09:30:05,071 - serialise - serialise.py:321 - INFO - setup - MainProcess - Mikado version: 2.3.4
2024-06-13 09:30:05,071 - serialise - serialise.py:322 - INFO - setup - MainProcess - Command line: /users/timg/.conda/envs/mikado/bin/mikado serialise --json-conf /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/configuration_v4.yaml --tsv /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.blast.tsv --orfs /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.fasta.transdecoder.bed --blast-loading-debug
2024-06-13 09:30:05,084 - serialise - serialise.py:332 - INFO - setup - MainProcess - Random seed: 0
2024-06-13 09:30:05,084 - serialise - serialise.py:345 - INFO - setup - MainProcess - Using a sqlite database (location: /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado.db)
2024-06-13 09:30:05,084 - serialise - serialise.py:348 - INFO - setup - MainProcess - Requested 1 threads, forcing single thread: False
2024-06-13 09:30:05,085 - serialise - serialise.py:176 - INFO - load_orfs - MainProcess - Starting to load ORF data
2024-06-13 09:30:30,488 - serialise - orf.py:351 - INFO - __serialize_single_thread - MainProcess - Finished loading 369856 ORFs into the database
2024-06-13 09:30:34,336 - serialise - serialise.py:187 - INFO - load_orfs - MainProcess - Finished loading ORF data
2024-06-13 09:30:34,435 - serialise - serialise.py:142 - INFO - load_blast - MainProcess - Starting to load BLAST data
2024-06-13 09:30:34,435 - serialise - blast_serialiser.py:82 - INFO - __init__ - MainProcess - Number of dedicated workers: 1
2024-06-13 09:30:34,441 - serialise - blast_serialiser.py:106 - WARNING - __init__ - MainProcess - Activating the XML debug mode
2024-06-13 09:30:45,733 - serialise - blast_serialiser.py:249 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2024-06-13 09:30:45,975 - serialise - blast_serialiser.py:283 - INFO - __serialize_targets - MainProcess - Loaded 41712 objects into the "target" table
2024-06-13 09:30:46,075 - serialise - blast_serialiser.py:174 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2024-06-13 09:30:46,492 - serialise - blast_serialiser.py:226 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2024-06-13 09:30:47,568 - serialise - blast_serialiser.py:233 - INFO - __serialize_queries - MainProcess - 450524 in queries
2024-06-13 09:30:47,598 - serialise - tab_serialiser.py:31 - INFO - _serialise_tabular - MainProcess - Creating a pool with 1 workers for analysing BLAST results
2024-06-13 09:30:48,538 - serialise - tabular_utils.py:431 - INFO - parse_tab_blast - MainProcess - Reading /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.blast.tsv data
2024-06-13 09:30:57,391 - serialise - serialise.py:388 - ERROR - serialise - MainProcess - Mikado crashed due to an error. Please check the logs for hints on the cause of the error; if it is a bug, please report it to https://github.com/EI-CoreBioinformatics/mikado/issues.
2024-06-13 09:30:57,392 - serialise - serialise.py:390 - ERROR - serialise - MainProcess - Cannot use a compiled regex as replacement pattern with regex=False

Command used:

mikado serialise \
--json-conf $out_dir/configuration_$version.yaml \
--tsv $out_dir/mikado_prepared.blast.tsv \
--orfs $out_dir/mikado_prepared.fasta.transdecoder.bed \
--blast-loading-debug

First 20 lines of $out_dir/mikado_prepared.blast.tsv:

bacillus_STRG-bacillus.1.1      sp|Q8RWX4|KOC1_ARATH    35.294  34      22      0       102     1       300     333     1.0     30.4    61.76   ILVL1FLSKQS1RQKNGSMERE2LVER4ESER1VI1EKASVSVASTDL1FY1
bacillus_STRG-bacillus.1.1      sp|Q1EHT7|C3H4_ORYSJ    36.000  25      16      0       261     187     879     903     2.6     29.3    72.00   1STQKIL2IVLF1LFKA1GIND2TENSLIFSHERQ2IL
bacillus_STRG-bacillus.1.1      sp|O65451|FB333_ARATH   44.737  38      17      1       120     19      175     212     4.4     28.5    63.16   2VQKRGN1ITVRVL1SK3KN-R-I-E-MGVML2QKLS1RD2DS2ED1RK2VI
bacillus_STRG-bacillus.1.1      sp|Q94EJ6|PMTE_ARATH    20.930  86      66      1       588     331     354     437     6.3     28.1    46.51   1IF1QKHKYIDNSDKR1GCGDPR1VT1SVLDET1*-S-IKQR1GDRTED1TV1WYNKGEYI1VTVCAV2SFFPRKLVASSNKELEKEMVLAKGRGNKMLRK1WFSPKEERML1GAIVVPAPVSKIMSRKWGTL1KNNG1EDGEMENSAYLQ2
bacillus_STRG-bacillus.1.1      sp|Q9FIY7|SM3L3_ARATH   33.333  36      24      0       129     22      554     589     6.4     28.1    66.67   KE1EDEYEDVEKRGA1ILVLVAFISAQKLR3MCRQEQQS2RQGNIK4VARP1AS
cpb_STRG-cpb.320.1      sp|Q8VXX4|RFC3_ARATH    85.606  132     19      0       176     571     1       132     3.56e-82        248     96.21   3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
cpb_STRG-cpb.320.1      sp|Q852K3|RFC5_ORYSJ    78.030  132     29      0       176     571     1       132     2.43e-76        234     89.39   3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
cpb_STRG-cpb.320.1      sp|Q9CAQ8|RFC5_ARATH    34.091  132     63      4       182     520     41      167     7.93e-17        78.2    53.03   1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1-S-V-K-L-V-L-L-D-E-A-D-A-M-T-K2GQ1QADL1YR1VIQEEKIYIT1
cpb_STRG-cpb.320.1      sp|Q93ZX1|RFC4_ARATH    43.939  66      37      0       182     379     11      76      1.24e-14        71.6    63.64   1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
cpb_STRG-cpb.320.1      sp|Q6YZ54|RFC3_ORYSJ    29.605  152     72      4       182     619     40      162     6.14e-14        69.7    48.68   1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1W-K-V-D-A-G-T-R-T-I-D-V-E-L-T-T-L-S-S-T-H-H-VL3PA2AEGRFGQI1R-Y-2QREQIQ1KQEDMF1KSNA1PSILDSTFKGGA1KQGSFV1-M-V-L-L-D-EGA1SATM1*KADSAFQYFLA1KR2LI
cpb_STRG-cpb.320.2      sp|Q8VXX4|RFC3_ARATH    85.606  132     19      0       145     540     1       132     6.89e-83        250     96.21   3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
cpb_STRG-cpb.320.2      sp|Q852K3|RFC5_ORYSJ    78.030  132     29      0       145     540     1       132     4.20e-77        235     89.39   3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
cpb_STRG-cpb.320.2      sp|Q9CAQ8|RFC5_ARATH    34.091  132     63      4       151     489     41      167     4.19e-17        78.6    53.03   1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1-S-V-K-L-V-L-L-D-E-A-D-A-M-T-K2GQ1QADL1YR1VIQEEKIYIT1
cpb_STRG-cpb.320.2      sp|Q93ZX1|RFC4_ARATH    43.939  66      37      0       151     348     11      76      6.79e-15        72.0    63.64   1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
cpb_STRG-cpb.320.2      sp|Q6YZ54|RFC3_ORYSJ    29.605  152     72      4       151     588     40      162     3.33e-14        70.5    48.68   1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1W-K-V-D-A-G-T-R-T-I-D-V-E-L-T-T-L-S-S-T-H-H-VL3PA2AEGRFGQI1R-Y-2QREQIQ1KQEDMF1KSNA1PSILDSTFKGGA1KQGSFV1-M-V-L-L-D-EGA1SATM1*KADSAFQYFLA1KR2LI
helixer_De_v1_hap3_chrs_chr_1_004325.1  sp|Q8VXX4|RFC3_ARATH    85.606  132     19      0       215     610     1       132     3.46e-83        251     96.21   3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
helixer_De_v1_hap3_chrs_chr_1_004325.1  sp|Q852K3|RFC5_ORYSJ    78.030  132     29      0       215     610     1       132     1.91e-77        236     89.39   3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
helixer_De_v1_hap3_chrs_chr_1_004325.1  sp|Q9CAQ8|RFC5_ARATH    38.235  102     54      3       221     514     41      137     4.63e-17        78.6    59.80   1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1
helixer_De_v1_hap3_chrs_chr_1_004325.1  sp|Q93ZX1|RFC4_ARATH    43.939  66      37      0       221     418     11      76      6.52e-15        72.4    63.64   1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
helixer_De_v1_hap3_chrs_chr_1_004325.1  sp|Q6YZ54|RFC3_ORYSJ    38.554  83      45      2       221     466     40      117     4.07e-14        70.1    62.65   1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1WLKEVLDN1GS-DTE1TG3

I tried using --force and --blast_targets $target_fasta and changing --tsv to --xml and the error stays the same. I am confused about --blast_targets flag. When is it needed?

Blast tsv was constructed with:

makeblastdb \
-in $out_dir/blast/$prot_name.fa \
-dbtype prot -parse_seqids > \
$out_dir/blast/"$prot_name"_prepare.log

blastx -max_target_seqs 5 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop" \
-num_threads $threads \
-query $out_dir/mikado_prepared.fasta \
-db $out_dir/blast/$prot_name.fa \
-out $out_dir/mikado_prepared.blast.tsv

I could find only this similar issue that mentions regular expressions when importing BLAST data: https://github.com/EI-CoreBioinformatics/mikado/issues/392 Should I try to rerun makeblastdb without -parse_seqids? It was also mentioned in another issue that some combination of blast results columns have to be unique. But the error there was different... I did not do any filtering of BLAST results.

Any ideas, what is causing the problem?

swarbred commented 3 months ago

As a starting point I would suggest simplifying the headers in the fasta file used to create your blast DB and redo your blast e.g. seqkit -i

jolbi commented 2 months ago

I simplified fasta headers to only include seq ID (e.g. Q8RWX4) and checked for potential duplicates. I created a new BLAST DB, rerun BLAST and the error stays the same. For some reason, the target IDs in the tsv file include string sp|<actual_id>|. Because of this I also tried to use IDs with a prefix (e.g. uniprot_Q8RWX4) and the IDs in the tsv now match fasta headers, but the error stays the same.

What else can I try? I'm running out of ideas. Should I rerun some previous stages of mikado after creating new BLAST results? I did not rerun any, only updated the path to BLAST target fasta in configuration yaml (for --json-conf).

Edit: I am using Mikado v2.3.4

jolbi commented 2 months ago

The error Cannot use a compiled regex as replacement pattern with regex=False is a pandas error, referring to str.replace function, used here: https://github.com/EI-CoreBioinformatics/mikado/blob/69abafef128d2441e20126159e6d480ff9c2e956/Mikado/serializers/blast_serializer/tabular_utils.py#L266-L267

From Pandas version >=2, argument regex defaults to False. I added regex=True to both lines and serialize now works.

My installation of Mikado installed with mamba uses pandas v2.2.2.

I'm not familiar with forking and issuing commits, so I'll leave this to you :)