anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
183 stars 48 forks source link

About some bugs and how to launch the project #156

Open PoisonousBromineChan opened 1 year ago

PoisonousBromineChan commented 1 year ago

Dear team, I'm sorry to bother you but I really have some questions on how to launch the project. The first question is that the project has numerous files & codes. As a beginner, it's rather too difficult to find which code to run.(I know the function is to merge ACM & DBLP together, but how to operate?) Another question is that I ran into a bug after I typed the order 'sudo python3 -u setup.py install --single-version-externally-managed --record=record.txt'. The error said my system lacked module 'numpy', however it was not the truth. So, can you give me some advice on how to debug and launch the program?

635ba40a8d4906d65fded5388ff0054 ff60912da2a3581f5c42746adfa9a23
Wish you a happy new year! 
anhaidgroup commented 1 year ago

Hi. I'm sorry for the late reply. The whole team has been on winter break between the two semesters, and we just got back. We notice that you have opened three issues, and we will do our best to help with them.

First, may we ask which OS and Python version you are using? We will then reply and go from there. Thank you and regards, AnHai

PoisonousBromineChan commented 1 year ago

Well,I'm so glad you replied to me! I'm using python 3.9.12 on my PC with Windows 11 and python 3.8 on my virtual machine Ubuntu20.04 Linux. What's more, I want to mention that now I can run your codes in all your ipynb of Basic EM Workflow Restaurants, but the DBLP&ACM keeps reporting errors.

I have solved Issue #157 myself, and I can share my findings with you. I found the length of decimal in "feature_vectors_dev" were > 6, which exceeded the range of float64. I modified your codes——I added feature_vectors_dev=feature_vectors_dev.round(6), but it reported KeyError: 'DataFrame information is not present in the catalog'. However, if feature_vectors_dev=feature_vectors_dev.round(6) is not added, it will report there's NaN, Infinity, or too large numbers can't be processed as float64.

Can you give me some advice? Thank you.

徐宸 @.***

 

------------------ 原始邮件 ------------------ 发件人: "anhaidgroup/py_entitymatching" @.>; 发送时间: 2023年1月18日(星期三) 上午7:41 @.>; @.**@.>; 主题: Re: [anhaidgroup/py_entitymatching] About some bugs and how to launch the project (Issue #156)

Hi. I'm sorry for the late reply. The whole team has been on winter break between the two semesters, and we just got back. We notice that you have opened three issues, and we will do our best to help with them.

First, may we ask which OS and Python version you are using? We will then reply and go from there. Thank you and regards, AnHai

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

anhaidgroup commented 1 year ago

Hi. Thanks for the info. We have verified that py_entitymatching can be installed in machine/software settings that you specified. We have also been able to replicate the same bug in the DBLP-ACM script that you had. We are looking into this bug and will get back to you asap. In the meantime if we have any questions, we will let you know. Thank you for your patience on this. AnHai

Anson-Doan commented 1 year ago

Hello, I am one of the py_entitymatching developers. I have identified the cause of issue #157. You can fix it by adding the argument "missing_val=numpy.nan" (no quotes) to any em.impute_table() function call in the program. impute_table() is called in cells 36, 45, and 51 in the DBLP-ACM notebook. It may appear in other jupyter notebooks as well; you will have to add this argument whenever impute_table() appears or you will get the same error. Note that you will have to import numpy before you can use numpy.nan as we do not import it in DBLP-ACM. Do not use feature_vectors_dev=feature_vectors_dev.round(), it will mess up the metadata and cause errors during catalog verification. In general, dataframes that are output from py_entitymatching functions should only be altered by other py_entitymatching functions; external function calls will not update the catalog properly and can cause errors.

PoisonousBromineChan commented 1 year ago

OK, thank you for your advice! I have succeeded in running the DBLP notebook. Best wishes, Chan.

PoisonousBromineChan commented 1 year ago

OK, thank you for your advice! I have succeeded in running the DBLP notebook. Best wishes, Chan.