galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.39k stars 999 forks source link

improve tool panel search #2272

Closed martenson closed 1 year ago

martenson commented 8 years ago

reported by @jennaj given the number of tools on Main the results of search needs to be better, mainly:

I am trying to address the first two (for Main) with: https://github.com/galaxyproject/usegalaxy-playbook/pull/19

mvdbeek commented 3 years ago

Paused at a breakpoint in https://github.com/dannon/galaxy/blob/c0d1a915a056b89b24f567664e7c02daf40deb2e/lib/galaxy/tools/search/__init__.py#L222

hexylena commented 3 years ago

Ahh ok, wondered if it was a secret api I was missing.

hexylena commented 3 years ago

so I booted up a copy of the app against EU because I always feel worried about reproducing locally with the v. different toolboxes. This looks odd to me:

(Pdb) galaxy_app.toolbox_search.parser.parse('*' + 'ucsc main' + '*')
Or([Wildcard('name', '*ucsc'), Wildcard('old_id', '*ucsc'), Wildcard('description', '*ucsc'), Wildcard('section', '*ucsc'), Wildcard('help', '*ucsc'), Wildcard('labels', '*ucsc'), Wildcard('stub', '*ucsc'), Prefix('name', 'main'), Prefix('old_id', 'main'), Prefix('description', 'main'), Prefix('section', 'main'), Prefix('help', 'main'), Prefix('labels', 'main'), Prefix('stub', 'main')])

why does only ucsc stay prefixed with *, and main loses it's one?

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('*ucsc main*'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
(296, <Hit {'id': 'ucsc_table_direct1'}>, 0.4618992716030244)
(297, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.22288633588616671)
None

or without *

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('ucsc main'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
None
(103, <Hit {'id': 'ucsc_table_direct1'}>, 0.4371187999893639)
None
None
None
(107, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.1950255439003959)

trying out the individual fields of a search, seems like description is a negative in this case:

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc* *main*'), limit=40): print(hit, hit.score)
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.9662951360360124
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpstblastn_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpsblast_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: notseq61/5.0.0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/tmhmm2/0.0.16'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.4456963370285034
<Hit {'id': 'ucsc_table_direct1'}> 0.44038679685244836
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.20059770229755006
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.1750118444883364
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.13424605571325632
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.11250224762822575
<Hit {'id': 'bwtool-lift'}> 0.05599209141540291
tool name description
vcf_to_maf_customtrack1 VCF to MAF Custom Track for display at UCSC
ucsc_table_direct1 UCSC Main table browser

feels very odd that vcf scores higher.

hexylena commented 3 years ago

Some more debugging

(Pdb) print(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True).termdocs)
{('name', b'ucsc'): array('I', [101, 1460, 2546]), ('description', b'ucsc'): array('I', [967, 1215, 1559, 2255, 2427]), ('description', b'maintaining'): array('I', [2122]), ('name', b'main'): array('I', [2546])}

So that's matching maintaing (hmm. I get why but. surely that should score lower than an exact word boundary match?)

and doc 2546 which hits both main + ucsc is indeed our tool:

(Pdb) print(list(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True))[1].docnum)
2546

aha (ish)

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.7205082440315107 967 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.38998429489994046 1215 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.1755229895103563 101 [('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.1736500008575602 2122 [('description', b'maintaining')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.15313536392729438 1559 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.11746529874909928 2427 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.09843946667469754 1460 [('name', b'ucsc')]
<Hit {'id': 'bwtool-lift'}> 0.048993079988477545 2255 [('description', b'ucsc')]

orgroup changed from 0.1 to 0.9 doesn't produce a big different. Oddly I've specified old_id in the MultifieldParser, but there are no ID matches? I'd exepect

<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc'), ('old_id', b'ucsc')]

but old_id isn't anywhere there? It's when help is included that the results become garbage:

<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/xpath/xpath/1.0.0'}> 5.7824381765403645 1006 [('help', b'maintainers')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/rsem/rsem_prepare_reference/1.1.17'}> 5.74646047844554 676 [('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_communitytype/mothur_get_communitytype/1.39.5.0'}> 5.6337872716676785 1183 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 5.628083628215019 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_lefse/mothur_lefse/1.39.5.0'}> 5.6050434590571285 1771 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.9'}> 5.4913612182626235 947 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_classify_rf/mothur_classify_rf/1.36.1.0'}> 5.4325805833938325 2391 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_pcr_seqs/mothur_pcr_seqs/1.39.5.0'}> 5.3072620955901515 121 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_mimarkspackage/mothur_get_mimarkspackage/1.39.5.0'}> 5.187549416742254 1622 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_merge_files/mothur_merge_files/1.39.5.0'}> 5.187549416742254 1767 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_primer_design/mothur_primer_design/1.39.5.0'}> 5.18227372171205 286 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/openbabel/ctb_subsearch/0.1'}> 5.105499296205204 1339 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_fastq_info/mothur_fastq_info/1.39.5.0'}> 5.105499296205204 1414 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_lookup/mothur_make_lookup/1.39.5.0'}> 5.0794508304082395 1662 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_cluster_classic/mothur_cluster_classic/1.39.5.0'}> 5.0794508304082395 1710 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_fastq/mothur_make_fastq/1.39.5.0'}> 5.0794508304082395 1723 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_chimera_vsearch/mothur_chimera_vsearch/1.39.5.1'}> 5.039159295530664 161 [('help', b'main_page')]

so they're all matching on the term main, even though EU's balances should preclude these getting ANY points:

(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['help']._field_B
{'help': 1.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['description']._field_B
{'description': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}

So constructing my own weightings

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
...
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

vs

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(2.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
....
<Hit {'id': 'ucsc_table_direct1'}> 5.63599403557272 2546 [('name', b'main'), ('name', b'ucsc')]

so name boost of 2 is worse than a name boost of 1? ucsc_table_direct1 goes from 8 to 5? Swapping the weights for name=1, help=2

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(2.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 11.821254960210167 1559 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'maintained')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 9.255265497863771 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

like, are boosts inverse? Fixing description to 40, name=1 returns ucsc_table_direct1 with the same score but vcf_to_maf_customtrack1 is finally gone?

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), description=BM25F(description_B=float(40.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 14.699808940243486 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

Got ucsc main above for the first time:

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(0.1)), description=BM25F(description_B=float(0.1)))).search(MultifieldParser(['name', 'old_id', 'description', 'section'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'ucsc_table_direct1'}> 12.409585144761694 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 6.3525702972838705 2122 [('description', b'maintaining')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 6.111259177622381 967 [('description', b'ucsc')]

With... both terms boosted to 0.1. This seems like black magic?

martenson commented 3 years ago

Boosts shouldn't be inverse: https://whoosh.readthedocs.io/en/latest/schema.html?highlight=boost#field-boosts (I am sorry I do not have time atm to dive into this)

hexylena commented 3 years ago

My thought too after reading the doc!! but, it definitely seems to be behaving like it is? it's the only time I can get ucsc_table_direct1 to have a high score (10+) is whenever I do name=0.1, desc=0.1, rest=1

mvdbeek commented 3 years ago

I am circling around a bug in whoosh's MultiWeighting class, which alters the scores in a non-sense way. Haven't finished this thoug.

hexylena commented 3 years ago

Compare the results for 'snpeff eff':

0.1name/desc → 25

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 35.753471381397226, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 33.047009486777924, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/snpeff_to_peptides/snpeff_to_peptides/0.0.1'}>, 25.2224812642123, 1511, [('help', b'_snpeff'), ('help', b'snpeff'), ('name', b'snpeff'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_databases/4.3+T.galaxy2'}>, 25.031613016029844, 1223, [('help', b'snpeff'), ('name', b'snpeff'), ('help', b'eff')])

vs

10.0 name/desc → 19

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=10.0), section=BM25F(section_B=1.0), description=BM25F(description_B=10.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 22.617891323807093, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 19.91142942918779, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])

edit: sorry, had an old help boost.

Or the query "select lines that match an expression"

0.1/0.1 → Grep1 = 40.0, 1st place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'Grep1'}>, 40.085088543825, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

40/40 → Grep1 = 16, 2nd place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=40.0), section=BM25F(section_B=1.0), description=BM25F(description_B=40.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1'}>, 18.069423636987338, 1181, [('help', b'match'), ('help', b'lines'), ('help', b'expressions'), ('help', b'expression'), ('help', b'select')])
(<Hit {'id': 'Grep1'}>, 16.10481479336669, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])
hexylena commented 3 years ago

@mvdbeek did you have any more information about what that issue was with whoosh?

hexylena commented 3 years ago

So we deployed the new boosts on eu, to see how those work. I.... think they're a huge improvement? I was discussing with @shiltemann and her test query was 'group', expecting the full match of Grouping1 to be found. We need some way to rank by "this term or terms constitutes the entire name field", but I'm not sure how we'd accomplish that given that we currently break into individual words :/

hexylena commented 3 years ago

@bgruening provides 'tail-to-head' which doesn't return useful things (but don't know about before.) and same for tail

@wm75 provides

only exception I found so far is mimodd vcf which only returns general vcf stuff as top hits. Strangely, reverting words to vcf mimodd does much better.

simonbray commented 3 years ago

Not sure if this is the right place to report issues with the search, but trying to find Filter failed datasets from a collection with the search is quite tough. For tools which contain a relatively unique word things seem to be better than they used to be :+1:

jrr-cpt commented 3 years ago

Putting this here after some interactions at CoFest. Search functions have definitely improved with updates. There are still cases where the search results could be improved. I think that it is intentional for the results to include potentially less relevant hits, to assist with tool discovery and to help with spelling errors/choices. But I think if the weighting more obvious biased the tool name over the description, this would help a lot with generic search terms. Our users at the CPT Galaxy would prefer a stricter (smaller) search result, and we don't even have as many tools as the larger public Galaxy's! Perhaps this is already implemented but it isn't terribly transparent how the tool search works and it is hard to pick out logical patterns in the return list order by eye (tool name relevance, alphabetical, popularity/use?)

For example, in our CPT Galaxy where we're running 20.05 (I realize that this does not have all the latest fixes discussed in this issue, @hexylena ngram searching is enabled), when I searched fasta looking for a tool called Remove FASTA Sequences from .gff3 File, it is 19th in the list and various tools with that string NOT in the name are before it.

fasta 1

At usegalaxy.org when I search align, I get this,

align

At usegalaxy.eu when I search for genome, the list includes the assembly tools like Bowtie2 and Spades pretty far down into the results.

genome

That is still the case when I search for genome assembly.

genome assembly

Maybe all these issues can be ameliorated with better tool metadata. New users, and users doing new analyses, will search for tools that they don't necessarily know the names of. Tool organization is pretty good, and while it is great to have many tool options, having very many tools also makes it hard to discover new ones without consistent help from the search function. Perhaps a ‘close match’ and ‘related match’ scenario to vastly improve the overall user experience, and make it easier to discover just the right tools?

martenson commented 1 year ago

most likely resolved with various PRs, please open a new issue for new requests/observations