Closed matthijsvanderloos closed 4 years ago
Do you mean you changed the code in embed_articlespace which now has
x <- sapply(x, FUN=function(tokens){
sentences <- split(tokens, tokens$sentence_id)
sentences <- sapply(sentences, FUN=function(x) paste(x$token, collapse = " "))
paste(sentences, collapse = " \t ")
})
Yes, if I change it to:
x <- sapply(x, FUN = function(tokens) {
sentences <- split(tokens, tokens$sentence_id)
sentences <- sapply(sentences, FUN = function(x) paste(x$token,
collapse = " "))
paste(sentences, collapse = "\t")
})
and then call starspace
manually, there's no empty term in the dictionary.
I'll have to dig into the Starspace C++ code to find this out. I maybe assumed wrongly that it should have been as indicated in the README of Starspace https://github.com/facebookresearch/StarSpace
roger federer loses <tab> venus williams wins <tab> world series ended
i love cats <tab> funny lolcat links <tab> how to be a petsitter
But when looking at the example of https://github.com/facebookresearch/Starspace/blob/master/examples/wikipedia_article_search.sh (although that training dataset is no longer available by facebook, I have it locally), their training data looks as follows:
"Gino 's Hamburgers\tGino 's Hamburgers was a fast-food restaurant chain founded in Baltimore , Maryland , by Baltimore Colts defensive end Gino Marchetti and running back Alan Ameche , along with their close friend Louis Fischer , in 1957 .\tA new group of restaurants under the Gino 's name involving some of the principals of the original chain was started in 2010 .\tThe first Gino 's was opened in Dundalk , Maryland , just outside Baltimore ; it got its official name in 1959 when the owners invited on Colts ' captain Gino Marchetti to become a partner .\tIn 1967 Gino 's merged with Tops Drive Inn , a chain of 18 drive-in restaurants located in the Washington , D.C. , area ; most Tops locations were rebranded as Gino 's .\tIn the early 1970s , the company attempted to expand from its East Coast base into the midwest , however these locations only operated a short period .\tFor one location , it purchased Orchestra Hall in Detroit and planned to demolish the structure to construct a restaurant .\tWhen the plan became public , it led to a grass-roots campaign to save and restore the abandoned structure .\tThe chain had 359 company-owned locations when Marriott Corporation acquired it in 1982 .\tMarriott discontinued the brand and converted locations to its Roy Rogers Restaurants chain .\tThe last Gino 's , located in Pasadena , Maryland and owned independently from Marriott , closed in 1986 .\tGino 's also operated the Rustler Steak House chain , which was sold by Marriott shortly after its purchase of Gino 's .\tThe restaurant was known for high-quality hamburgers such as the Sirloiner , which was made from sirloin steak ( and was originally a staple of Tops Drive Inn ) , and the Gino Giant , which predated and later competed with the Big Mac .\tThe company held the franchise for Kentucky Fried Chicken in the Mid-Atlantic states .\tThe company 's jingle , played during radio advertisements in the early years was `` Everybody goes to Gino 's , ‘ cause Gino 's is the place to go ! ''\tThe original menu included french fries made fresh in each restaurant .\tOne of the last menu changes included burgers made from fresh ground beef each morning , unique in the fast food industry at the time .\tThe company also became known for its philanthropic efforts .\tThe executives of the company supported many educational , cultural , recreational , and athletic programs .\tThis made the community and the company tied with the common goal of trying to help the youth .\tMarchetti , Romano , and Fischer have opened several new Gino 's restaurants .\tMarchetti and Fischer will be serving as consultants .\tThe new restaurants plan to serve burgers , chicken sandwiches , hand-cut french fries and hand-spun milkshakes .\tInitially , the chain plans to open locations in Pennsylvania and Maryland .\tIn charge is Tom Romano , who worked for 20 years with the company , and was C.O.O .\tin 1982 when the chain was sold .\t`` It 's apparent there 's a need for better burgers out there '' , said Romano , citing the success of such chains as Five Guys , and Gino 's Burgers and Chicken has placed itself upscale of the earlier Gino 's .\tGino 's plans to make its burgers to order from fresh beef .\tTheir first location opened in King of Prussia , Pennsylvania , the same town as the original chain 's headquarters , on October 25 , 2010 .\tPlans were announced in Spring 2011 for franchise expansion into Baltimore .\tOn August 17 , 2011 , a second Gino 's location opened in Towson , Maryland .\tAnother Gino 's opened in Bensalem , Pennsylvania on October 11 , 2011 .\tA Gino 's Burgers and Chicken opened in Oriole Park at Camden Yards at the start of the Orioles season in 2012 , but closed by the end of the 2014 season .\tOn January 22 , 2013 , Gino 's Burgers and Chicken opened in Aberdeen , Maryland , however the Bensalem location closed around the same time .\tLater , on July 9 , 2013 , the King of Prussia location closed , effectively leaving the Philadelphia market .\tThe location at Perry Hall , Maryland , which opened on March 5 , 2012 , closed on December 8 , 2013 .\tThe Aberdeen location closed on March 27 , 2016 , leaving only the Towson and Glen Burnie locations ."
So it looks indeed like " \t " should be replaced by "\t"
I'm also using this " \t " in embed_docspace and embed_sentencespace so probably doublecheck C++ code there too.
But I need to doublecheck as it might also be that it needs "\t "
Chain of commands:
run sp->init(): https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/rcpp_textspace.cpp#L319 run initParser: https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L79
if fileformat is of type fastText, parser is of type DataParser, if fileformat is of type labelDoc, parser is of type LayerDataParser type labelDoc is used in embed_sentencespace / embed_docspace / embed_articlespace
run dict->readFromFile(filename, parser); https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L82 this calls https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/dict.cpp#L135 which runs over all lines and parses out the dictionary https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/dict.cpp#L150 there the magic happens: https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L50 but argument sep was not passed in the call there so it uses the default which is at https://github.com/bnosac/ruimtehol/blob/611fa5e001db743b44e6bf643cd99de36a170b4d/src/Starspace/src/parser.h#L62, which is "\t "
run trainData = initData(): https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/starspace.cpp#L87 trainData is then of type InternDataHandler run trainData->loadFromFile(args->trainFile, parser); https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/data.cpp#L40 then calls parser->parse https://github.com/bnosac/ruimtehol/blob/da1cd4da6bbcc6a7cc4c346989e8476f37864091/src/Starspace/src/data.cpp#L58 and eventually calls https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.h#L60 and https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L38 again splitting on "\t "
But if we implement this in embed_articlespace, there is "" in the dictionary...
Ok found out after some debugging
embed_tagspace & embed_wordspace & embed_entityrelationspace & embed_pagespace & embed_imagespace (which use fastText format) basically parse data as follows
"\t "
, meaning splits by tab OR space as it uses boost::is_any_of"\t"
followed by parser at word level which uses " "
embed_sentencespace & embed_docspace & embed_articlespace (which use labelDoc format) basically parse data as follows
"\t "
, meaning splits by tab OR space as it uses boost::is_any_of"\t"
followed by parser at word level which uses " "
So conclusion to this, I should update " \t " with "\t" as a tab followed by a space would generate that "" in the dictionary.
Extra note. It is equally important that the R user remove double spaces by one space to avoid having "" in the dictionary.
Hi Jan,
I've been using ruimtehol for a while now, specifically the articlespace embedding functionality. However, I recently noticed that the
embed_articlespace
function produces an empty term in the dictionary of the StarSpace model.Reproducible example:
When I change the sentence separator to
\t
(instead of<space>\t<space>
), the empty term is not in the dictionary anymore.Hence, should the sentence separator for the labelDoc format be surrounded with spaces or not?