bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

Sentence separator for labelDoc format #22

Closed matthijsvanderloos closed 4 years ago

matthijsvanderloos commented 4 years ago

Hi Jan,

I've been using ruimtehol for a while now, specifically the articlespace embedding functionality. However, I recently noticed that the embed_articlespace function produces an empty term in the dictionary of the StarSpace model.

Reproducible example:

library(ruimtehol)
library(udpipe)

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x$token <- x$lemma
x <- x[, c("doc_id", "sentence_id", "token")]
set.seed(123456789)
model <- embed_articlespace(x, early_stopping = 1,
                            dim = 25, epoch = 25, minCount = 2,
                            negSearchLimit = 1, maxNegSamples = 2)
dict <- starspace_dictionary(model)
# Empty string in dictionary?
dict$dictionary[1, ]

>>  term is_word is_label
>>1         TRUE    FALSE

When I change the sentence separator to \t (instead of <space>\t<space>), the empty term is not in the dictionary anymore.

Hence, should the sentence separator for the labelDoc format be surrounded with spaces or not?

jwijffels commented 4 years ago

Do you mean you changed the code in embed_articlespace which now has

x <- sapply(x, FUN=function(tokens){
    sentences <- split(tokens, tokens$sentence_id)
    sentences <- sapply(sentences, FUN=function(x) paste(x$token, collapse = " "))
    paste(sentences, collapse = " \t ")
  })
matthijsvanderloos commented 4 years ago

Yes, if I change it to:

x <- sapply(x, FUN = function(tokens) {
  sentences <- split(tokens, tokens$sentence_id)
  sentences <- sapply(sentences, FUN = function(x) paste(x$token, 
                                                         collapse = " "))
  paste(sentences, collapse = "\t")
})

and then call starspace manually, there's no empty term in the dictionary.

jwijffels commented 4 years ago

I'll have to dig into the Starspace C++ code to find this out. I maybe assumed wrongly that it should have been as indicated in the README of Starspace https://github.com/facebookresearch/StarSpace

roger federer loses <tab> venus williams wins <tab> world series ended
i love cats <tab> funny lolcat links <tab> how to be a petsitter  

But when looking at the example of https://github.com/facebookresearch/Starspace/blob/master/examples/wikipedia_article_search.sh (although that training dataset is no longer available by facebook, I have it locally), their training data looks as follows:

"Gino 's Hamburgers\tGino 's Hamburgers was a fast-food restaurant chain founded in Baltimore , Maryland , by Baltimore Colts defensive end Gino Marchetti and running back Alan Ameche , along with their close friend Louis Fischer , in 1957 .\tA new group of restaurants under the Gino 's name involving some of the principals of the original chain was started in 2010 .\tThe first Gino 's was opened in Dundalk , Maryland , just outside Baltimore ; it got its official name in 1959 when the owners invited on Colts ' captain Gino Marchetti to become a partner .\tIn 1967 Gino 's merged with Tops Drive Inn , a chain of 18 drive-in restaurants located in the Washington , D.C. , area ; most Tops locations were rebranded as Gino 's .\tIn the early 1970s , the company attempted to expand from its East Coast base into the midwest , however these locations only operated a short period .\tFor one location , it purchased Orchestra Hall in Detroit and planned to demolish the structure to construct a restaurant .\tWhen the plan became public , it led to a grass-roots campaign to save and restore the abandoned structure .\tThe chain had 359 company-owned locations when Marriott Corporation acquired it in 1982 .\tMarriott discontinued the brand and converted locations to its Roy Rogers Restaurants chain .\tThe last Gino 's , located in Pasadena , Maryland and owned independently from Marriott , closed in 1986 .\tGino 's also operated the Rustler Steak House chain , which was sold by Marriott shortly after its purchase of Gino 's .\tThe restaurant was known for high-quality hamburgers such as the Sirloiner , which was made from sirloin steak ( and was originally a staple of Tops Drive Inn ) , and the Gino Giant , which predated and later competed with the Big Mac .\tThe company held the franchise for Kentucky Fried Chicken in the Mid-Atlantic states .\tThe company 's jingle , played during radio advertisements in the early years was `` Everybody goes to Gino 's , ‘ cause Gino 's is the place to go ! ''\tThe original menu included french fries made fresh in each restaurant .\tOne of the last menu changes included burgers made from fresh ground beef each morning , unique in the fast food industry at the time .\tThe company also became known for its philanthropic efforts .\tThe executives of the company supported many educational , cultural , recreational , and athletic programs .\tThis made the community and the company tied with the common goal of trying to help the youth .\tMarchetti , Romano , and Fischer have opened several new Gino 's restaurants .\tMarchetti and Fischer will be serving as consultants .\tThe new restaurants plan to serve burgers , chicken sandwiches , hand-cut french fries and hand-spun milkshakes .\tInitially , the chain plans to open locations in Pennsylvania and Maryland .\tIn charge is Tom Romano , who worked for 20 years with the company , and was C.O.O .\tin 1982 when the chain was sold .\t`` It 's apparent there 's a need for better burgers out there '' , said Romano , citing the success of such chains as Five Guys , and Gino 's Burgers and Chicken has placed itself upscale of the earlier Gino 's .\tGino 's plans to make its burgers to order from fresh beef .\tTheir first location opened in King of Prussia , Pennsylvania , the same town as the original chain 's headquarters , on October 25 , 2010 .\tPlans were announced in Spring 2011 for franchise expansion into Baltimore .\tOn August 17 , 2011 , a second Gino 's location opened in Towson , Maryland .\tAnother Gino 's opened in Bensalem , Pennsylvania on October 11 , 2011 .\tA Gino 's Burgers and Chicken opened in Oriole Park at Camden Yards at the start of the Orioles season in 2012 , but closed by the end of the 2014 season .\tOn January 22 , 2013 , Gino 's Burgers and Chicken opened in Aberdeen , Maryland , however the Bensalem location closed around the same time .\tLater , on July 9 , 2013 , the King of Prussia location closed , effectively leaving the Philadelphia market .\tThe location at Perry Hall , Maryland , which opened on March 5 , 2012 , closed on December 8 , 2013 .\tThe Aberdeen location closed on March 27 , 2016 , leaving only the Towson and Glen Burnie locations ."

So it looks indeed like " \t " should be replaced by "\t"

jwijffels commented 4 years ago

I'm also using this " \t " in embed_docspace and embed_sentencespace so probably doublecheck C++ code there too.

jwijffels commented 4 years ago

But I need to doublecheck as it might also be that it needs "\t "

jwijffels commented 4 years ago

Note to myself: https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/doc_parser.cpp#L79

jwijffels commented 4 years ago

Chain of commands:

run sp->init(): https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/rcpp_textspace.cpp#L319 run initParser: https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L79

if fileformat is of type fastText, parser is of type DataParser, if fileformat is of type labelDoc, parser is of type LayerDataParser type labelDoc is used in embed_sentencespace / embed_docspace / embed_articlespace

run dict->readFromFile(filename, parser); https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L82 this calls https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/dict.cpp#L135 which runs over all lines and parses out the dictionary https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/dict.cpp#L150 there the magic happens: https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L50 but argument sep was not passed in the call there so it uses the default which is at https://github.com/bnosac/ruimtehol/blob/611fa5e001db743b44e6bf643cd99de36a170b4d/src/Starspace/src/parser.h#L62, which is "\t "

run trainData = initData(): https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L87 trainData is then of type InternDataHandler run trainData->loadFromFile(args->trainFile, parser); https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/data.cpp#L40 then calls parser->parse https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/data.cpp#L58 and eventually calls https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.h#L60 and https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L38 again splitting on "\t "

jwijffels commented 4 years ago

But if we implement this in embed_articlespace, there is "" in the dictionary...

jwijffels commented 4 years ago

Ok found out after some debugging

embed_tagspace & embed_wordspace & embed_entityrelationspace & embed_pagespace & embed_imagespace (which use fastText format) basically parse data as follows

embed_sentencespace & embed_docspace & embed_articlespace (which use labelDoc format) basically parse data as follows

So conclusion to this, I should update " \t " with "\t" as a tab followed by a space would generate that "" in the dictionary.

Extra note. It is equally important that the R user remove double spaces by one space to avoid having "" in the dictionary.