BioKIC / NEON-Biorepository

Development base for the NEON Biorepository Data Portal host by BioKIC - Arizona State University (https://biorepo.neonscience.org)
GNU General Public License v2.0
2 stars 1 forks source link

Locality search bug, no results when searching for siteID #341

Closed kyule closed 10 months ago

kyule commented 1 year ago

No samples show up when either the siteID "TOOK" is typed into the locality search or the TOOK - Toolik lake site is selected as a checkbox under D18. I have not checked many other site codes, but I haven't noticed the issue with other sites. If I type in the url https://biorepo.neonscience.org/portal/collections/list.php?datasetid=128 I get the correct results because it is dataset 128 should be TOOK samples, https://biorepo.neonscience.org/portal/collections/list.php?local=TOOK however brings up zero results.

kyule commented 10 months ago

@sunray1 I also confirmed that this is only happening for TOOK. This also needs to be fixed ahead of the TOS Palooza in mid-January

sunray1 commented 10 months ago

Likely because we're using an indexed fulltext search to search locality strings (faster, but misses some results due to search rules) and the word "took" is a stop word for MyISAM. See: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html

@egbot - this will have to be removed and the index rebuilt. Unsure how that was built prior and I likely do not have the permissions to do this.

sunray1 commented 10 months ago

Related @kyule - Could be fine, but FYI:

Occurrences like https://biorepo.neonscience.org/portal/collections/editor/occurrenceeditor.php?occid=236992 will not show up in https://biorepo.neonscience.org/portal/collections/list.php?local=TOOL since the full searches only allow you to search by whole words.

kyule commented 10 months ago

Thanks for pointing that out; I definitely didn't remember that! It is an issue right now for older specimens. We improved the way we harvest locality information from NEON though, so anytime one of these records is reharvested it theoretically should reformat the locality field to make sure the siteID is in there. Just reharvested this one and TOOK is in there.

egbot commented 10 months ago

That's exactly it! As you state, locality is a MyISAM fulltext index. MyISAM Fulltext stopwords are defined within the MySQL file system (INNODB is stored in a table). Modifying this text file and restarting MySQL/MariaDb is one solution. As an alternative that will only affects the NEON portal, I modified the code to skip using the fulltext index when certain words are applied against the locality field. I already have a similar solution applied for collector field. See pull request below.

https://github.com/BioKIC/NEON-Biorepository/pull/378

When 3.1 rolls out, we are probably going to be switching over to a INNODB full text lookup, which uses a smaller list of stop words.

egbot commented 10 months ago

Issue resolved: https://github.com/BioKIC/NEON-Biorepository/commit/49cec09622b7e26cee337ace6e50d55e344a37cc