ekzhu / josie

Code and Benchmarks for JOSIE (SIGMOD 2019)
18 stars 4 forks source link

How to run JOSIE to find joinable tables (columns) given table csv? #3

Open v4ray opened 1 year ago

v4ray commented 1 year ago

Hi, this is a great work! I am trying to experiment with JOSIE to find joinable tables and unsure about the data pipeline. Could you briefly explain how to use this JOSIE codebase to find joinable tables given a query column, if the input data are several raw csv files (another dataset) representing tables?

This code base seems to depend on postgres dump files representing tables. Is it necessary to generate these dump files for the above purpose and if so how to do it?

Thank you!

ekzhu commented 1 year ago

This repo is more for reproducibility in academic settings. If you are interested in building a real application maybe you can take a look at:

  1. ekzhu/SetSimilaritySearch: All-pair set similarity search on millions of sets in Python and on a laptop (github.com) https://github.com/ekzhu/setsimilaritysearch
  2. MinHash LSH — datasketch 1.5.9 documentation (ekzhu.com) https://ekzhu.com/datasketch/lsh.html

None of the above implements JOSIE but should be good enough depending on your use case.

On Sat, Apr 8, 2023 at 5:53 PM v4ray @.***> wrote:

Hi, this is a great work! I am trying to experiment with JOSIE to find joinable tables and unsure about the data pipeline. Could you briefly explain how to use this JOSIE codebase to find joinable tables given a query column, if the input data are several raw csv files representing tables?

This code base seems to depend on postgres dump files representing tables. Is it necessary to generate these dump files for the above purpose and if so how to do it?

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/ekzhu/josie/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLUVJTDDOVIVMKTMIMLXAIB7NANCNFSM6AAAAAAWXYVUVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ekzhu commented 1 year ago

I recommend starting with MinHashLSH for finding joinbale tables. You first create MinHash for every column. Then you index all the MinHash in an MinHashLSH index. After that you can query the index for columns with high Jaccard similarity.

v4ray commented 1 year ago

thanks