ShansanChu / insight_DS

Data science project at Insight
3 stars 0 forks source link

set some user preferences, custom use case #3

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi @ShansanChu ,

Hope you are all well !

I wanted to ask you a couple of questions as I am developing a website called https://paper2code. It is a search engine for research papers and their related source codes.

So, one feature, I wanted to install is really similar to your repository, https://github.com/ShansanChu/insight_DS.

I d like help researcher to browse arxiv/aclweb,etc... papers by using their abstract's text and calculating their similarity. But I would like to add a collaborative layer where if they favorite a paper it allows to improve the ranking of the text similarity results.

So my questions are:

Please find below some articles defining mostly what I d like to do for paper2code.

Refs:

Thanks in advance for any reply or input on these questions.

Cheers, X

ShansanChu commented 4 years ago

Thanks for your interest in my project. If you are using abstract’s text, I guess it’s not quite longer. You don’t need to do summarization. One possible method is to directly use the sentences embedding as feature extraction. Sure, it’s possible to do personalized ranking by adding more layers or utilizing ranking models with sentence embedding as input.

As for my project, I don’t have labels for training. What I do with my project is to retrieve similar articles based on similarity of sentences. But I don’t have the ranking part. I won’t suggest you to add more layers based on my project. For your case, you can use sentence embedding of abstract sentences as feature extraction (obtain embedding input (item’s character). And then apply ranking models to train your own ranking model.

Cheers, Shansan

在 2020年7月25日,下午1:22,x0rzkov notifications@github.com 写道:



Hi @ShansanChu https://github.com/ShansanChu ,

Hope you are all well !

I wanted to ask you a couple of questions as I am developing a website called https://paper2code. It is a search engine for research papers and their related source codes.

So, one feature, I wanted to install is really similar to your repository, https://github.com/ShansanChu/insight_DS.

I d like help researcher to browse arxiv/aclweb,etc... papers by using their abstract's text and calculating their similarity. But I would like to add a collaborative layer where if they favorite a paper it allows to improve the ranking of the text similarity results.

So my questions are:

Please find below some articles defining mostly what I d like to do for paper2code.

Refs:

Thanks in advance for any reply or input on these questions.

Cheers, X

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ShansanChu/insight_DS/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM4DQLVPCTCBT64PMWM7DDR5JTYJANCNFSM4PHJD2QA .

ghost commented 4 years ago

Hi,

Thanks for the reply :-)

In fact, I have hybrid documents, like READMEs, from the code related, that I want to summarize with bert-extractive-summarizer, and the famous abstracts. So I do no know how to flag to skip this step of summarizing during the training process.

Can you help me/us to make a generic solution of insight_DS ? It would be really awesome.

I mean that I am much more a gopher than a pythonista and I am little bit lost with the order to execute the training scripts. For the rest, I mean the server and the frontend stuff, I can manage it with no worries.

If you have 5-10 spare minutes to do it, you would be an awesome guy. What I can offer, it is do dockerize the final stuff in a fork and make a PR.

Cheers, X

ghost commented 4 years ago

@ShansanChu any chance for the generic solution ? please :-)

ShansanChu commented 4 years ago

Sorry I have busy schedules these couple weeks. I’ll get back to you when I’m available.

On Wednesday, July 29, 2020, x0rzkov notifications@github.com wrote:

@ShansanChu https://github.com/ShansanChu any chance for the generic solution ? please :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ShansanChu/insight_DS/issues/3#issuecomment-665555052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM4DQNYVTL3VKJTN5TQE7LR57UNTANCNFSM4PHJD2QA .

ghost commented 4 years ago

Thanks mate

ShansanChu commented 4 years ago

Hi,

Sorry to get back to you later. Returning to the training case, if you want to skip the summarization step with flag, I would suggest you to use the summarization as part of your step in data preprocessing rather than adding this part to the training model. If you want to train the model as part of the customized ranking, I'm not sure if this works as a end-to-end model. For the summarization part, to train the model we need to calculate derivatives during the backpropagation, maybe it would be complicated to train end-to-end model. To train the summarization part, we need the labeled dataset.

Best, Shan

On Thu, Jul 30, 2020 at 9:28 PM Luc Michalski notifications@github.com wrote:

Thanks mate

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ShansanChu/insight_DS/issues/3#issuecomment-666364464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM4DQKGGZ7NRRMHM4KNRG3R6FYP3ANCNFSM4PHJD2QA .

ghost commented 4 years ago

Hi,

Thanks for the reply.

I have a dataset of 290k abstracts and here is the link to download it. http://paper2code.com/public/suggest_dump.txt.tar.gz

Do you think we can give a try to this one ? that would be awesome

Cheers, X