Finetune on arxiv dataset

Karol-G commented 3 years ago

Hi,

thanks for your amazing work! However, I currently still have some problems on getting good results. I want to use BERTopic on the kaggle arxiv abstract dataset https://www.kaggle.com/Cornell-University/arxiv It is a dataset that contains the abstract of each paper on arxiv. In total 1796908 abstracts, but I am using only 1/4 of them due to hardware constraints, so 449227 abstracts. The raw data is a list of dicts with each dict containing stuff like author, title, abstract and etc. but I am only using the abstracts itself. My current results are sadly not what I expected. Here is the output of model.get_topics():

##################################
[('withdrawn', 0.12732245199899253), ('arxiv', 0.060818479638394804), ('author', 0.045282397053936205), ('been', 0.043582983757148634), ('paper', 0.04331377340066525), ('authors', 0.03602908119595011), ('has', 0.03413129955351502), ('discussion', 0.020046558277271205), ('version', 0.017570171724893863), ('error', 0.016245558058635576), ('due', 0.016034569088373845), ('4002', 0.015203603208166275), ('article', 0.015178787241213468), ('mcshane', 0.014512825764984364), ('1104', 0.013893798421411663), ('crucial', 0.012724570309587551), ('wyner', 0.011639183974558176), ('proxies', 0.011545341114998098), ('please', 0.011257392365683372), ('0804', 0.010829445454730597)]
##################################
[('withdrawn', 1.2378088374383105), ('been', 0.33161619452791685), ('paper', 0.2815599045047751), ('has', 0.2598521946696877), ('administratively', 0.037473809176819514), ('article', 0.035331755008345955), ('retracted', 0.032019088876856915), ('abstract', 0.03194552517951581), ('withdraw', 0.03023297555105207), ('submission', 0.025366426781504862), ('mistake', 0.024461766176310584), ('rewriting', 0.02108034213126981), ('want', 0.019899921380598113), ('this', 0.018769808691909386), ('shorter', 0.01690150874372038), ('comment', 0.01634104139337519), ('probably', 0.016086126678481083), ('applicable', 0.015210457362549933), ('modification', 0.014865572450063681), ('longer', 0.014582146616905768)]
##################################
[('isotopes', 0.2558417644790476), ('thirty', 0.22683223596981578), ('refereed', 0.19469496374394987), ('publication', 0.1454113600287234), ('isotope', 0.14061126024476908), ('brief', 0.11791781235255983), ('identification', 0.10522952641115933), ('discovery', 0.09816511283302375), ('summary', 0.08730514775501705), ('synopsis', 0.07636960227187568), ('production', 0.07302537404971297), ('discussed', 0.06821321035793458), ('including', 0.06506837933191251), ('presented', 0.06352529575007038), ('twenty', 0.057115384937416365), ('eight', 0.05686672933874315), ('each', 0.054793906411334324), ('far', 0.05099118599417039), ('minerals', 0.04545668048089448), ('observed', 0.04482361175054437)]
##################################
[('withdrawn', 0.8102220016577751), ('author', 0.5882810125654714), ('been', 0.21955498849750296), ('paper', 0.1935837075655028), ('has', 0.17516982841654938), ('pourmohammad', 0.08015698196117896), ('ali', 0.0605915947645577), ('seemann', 0.027628108579230422), ('eqn', 0.02270159607399226), ('admin', 0.022251661779234076), ('request', 0.01530721857357427), ('by', 0.013408498518361402), ('this', 0.012868228621997208), ('modification', 0.010599219818655269), ('authors', 0.010375596329419137), ('arxiv', 0.010147235836885003), ('km', 0.008868127574553979), ('due', 0.0053505630213037635), ('first', 0.004174426091124793), ('at', 0.0013290983743134937)]
##################################
[('de', 0.16471413053677894), ('la', 0.08859824729535025), ('un', 0.07960098252808794), ('en', 0.07656758724369946), ('des', 0.07493017494049987), ('une', 0.0685045487905329), ('est', 0.06506619811186878), ('nous', 0.0552461357202294), ('que', 0.0505970341092853), ('dans', 0.04833955653453861), ('pour', 0.04773108415278024), ('les', 0.04405246081293785), ('et', 0.04269515259835229), ('sur', 0.0425329786858753), ('caract', 0.034373522683066204), ('le', 0.03028301508609669), ('es', 0.029084319074609982), ('ees', 0.028840836535619835), ('cette', 0.023815804070613532), ('eme', 0.023083284220080345)]
##################################
[('model', 0.0029213211816859273), ('two', 0.0029181910122442487), ('it', 0.002917764978256985), ('can', 0.002911863114896897), ('these', 0.002900525114119986), ('our', 0.0028719993646575373), ('show', 0.0028703487897916566), ('results', 0.002862179058792491), ('also', 0.0028543448650093332), ('field', 0.002807120623162036), ('have', 0.0027961151449595436), ('using', 0.002780531524136966), ('between', 0.0027687481202621554), ('or', 0.002762760512175864), ('one', 0.0027467154286522437), ('time', 0.002741766841704294), ('energy', 0.0027274038973420667), ('data', 0.0026880146639130568), ('quantum', 0.0026615769324125527), ('such', 0.002660012066337066)]
##################################
[('withdrawn', 0.25382672910326465), ('arxiv', 0.1859321307804051), ('author', 0.10424878083447243), ('been', 0.09053395420662566), ('paper', 0.0810609839954744), ('has', 0.06435182608618999), ('version', 0.05760067485755415), ('authors', 0.05258540866955608), ('superseded', 0.04795847515378754), ('replaced', 0.043281616461112844), ('merged', 0.03743469139047417), ('0804', 0.03698167270671404), ('1008', 0.03085884218486115), ('because', 0.030835129343355642), ('0812', 0.023350724849799137), ('0901', 0.022639659892192746), ('revised', 0.02187828357506638), ('1306v6', 0.021542196549539806), ('submission', 0.02115347128457661), ('3484', 0.020833341465434623)]
##################################
[('withdrawn', 0.3174814989979916), ('author', 0.11776777433239484), ('been', 0.08872075679275165), ('paper', 0.08199984676632026), ('due', 0.08048541635175363), ('has', 0.07026260094965996), ('error', 0.05649266661391457), ('authors', 0.05390187230801506), ('arxiv', 0.051928726372487306), ('because', 0.034956486686744344), ('mistake', 0.032456919238108894), ('crucial', 0.0322646808282665), ('submission', 0.029450103971990518), ('administrators', 0.02776402639935557), ('admin', 0.024154968037124056), ('proof', 0.02232748069916814), ('errors', 0.017136278392237612), ('lemma', 0.015641869372412024), ('copyright', 0.015397080186955915), ('theorem', 0.014757663158028029)]

As you can see, the extracted topics are kind of bad and not what I have hoped for. Can you give me some advice why this is not working and what I should finetune?

Best Karol

MaartenGr commented 3 years ago

Hi Karol,

How many topics have been created? If only a few topics were created, then this results in poor topic representations. I think it is worthwhile to decrease min_topic_size to 10 as it is likely to create more topics. This will also allow better topic representations to be created.

Also, BERTopic v0.4 just came out and contains significant improvements. I think it would be worthwhile to upgrade if you haven't so already! You can find more extensive tutorials for BERTopic in the updated documentation here.

Karol-G commented 3 years ago

Hi,

thanks for the quick reply! I will update to the new version and decrease the min_topic_size to 10.

I think it is worthwhile to decrease min_topic_size to 10 as it is likely to create more topics

But why would it create more topics if I decrease the minimum number of topics? Wouldn't this result in less topics?

Best Karol

MaartenGr commented 3 years ago

Great, let me know if you get different results. With min_topic_size we do not decrease the minimum number of topics, but their minimum size. If you lower this value then smaller topics can be created which allows more topics to be created. I had it set way too high in the previous version of BERTopic which typically resulted in less than 10 topics, while in practice you often see >50 topics.

Karol-G commented 3 years ago

Ah, I misread the parameter name the entire time >.< It makes sense now ;) Thanks again!

Karol-G commented 3 years ago

Hi again,

I finally had time to test your new version. The results are much better now, but this could also be because I forgot to sort them by frequency in my first post >.<

Here are the top 20 topics:

[('three', 0.013054170945547257), ('threedimensional', 0.006332323651834506), ('theory', 0.004608694671529709), ('threebody', 0.0038531703016460917), ('dimensions', 0.003651438206073678), ('scattering', 0.003253732764247434), ('quantum', 0.0032394830436024377), ('spin', 0.0031230529976747625), ('gravity', 0.003063954135957652), ('space', 0.0030414565055789785)]
[('graphene', 0.07436791552953284), ('electronic', 0.011277199258163029), ('bilayer', 0.010491573310948656), ('nanoribbons', 0.007713373553312529), ('graphite', 0.006762533556490666), ('layer', 0.006583893115193778), ('electron', 0.006568532342191205), ('carbon', 0.006369050510125981), ('magnetic', 0.0061219933246205605), ('layers', 0.005525049121293272)]
[('three', 0.013689297571895319), ('3manifold', 0.008894316411136385), ('3manifolds', 0.008129555355640364), ('hyperbolic', 0.007593168927445724), ('manifolds', 0.006493555788642359), ('manifold', 0.0063464379295909415), ('algebra', 0.00530003016046757), ('dimension', 0.00509895596685615), ('threefolds', 0.004905792710294163), ('algebras', 0.0048867687578040145)]
[('magnetic', 0.013151799166270458), ('superconducting', 0.011056026119834429), ('superconductivity', 0.010798289838756974), ('measurements', 0.006968743474354005), ('crystals', 0.006860577347932697), ('magnetization', 0.0068034465649972654), ('compounds', 0.006451027082081614), ('crystal', 0.006135389676944634), ('superconductors', 0.006120444728472715), ('structural', 0.0060716763341440326)]
[('graph', 0.05730184149686418), ('graphs', 0.045628145142319054), ('vertices', 0.026819255545078042), ('vertex', 0.018389062637356766), ('edges', 0.01502037677020415), ('subgraph', 0.007530116720545857), ('algorithm', 0.007406410476551838), ('coloring', 0.006973399728935682), ('connected', 0.006774455367355813), ('trees', 0.006569312817790809)]
[('string', 0.016074269579657928), ('gauge', 0.013872350267778905), ('four', 0.01172086323512176), ('theories', 0.009231287482594374), ('n4', 0.009124189000119019), ('dimensions', 0.008111317974220343), ('fourdimensional', 0.007788796978357288), ('supersymmetric', 0.0066570476967985565), ('4d', 0.005515783327101184), ('dimensional', 0.005240544660597792)]
[('regression', 0.015793994072434116), ('estimator', 0.012110553516710846), ('estimation', 0.010565802980851507), ('estimators', 0.009744969913795903), ('likelihood', 0.008671026207819644), ('distribution', 0.007834517888185543), ('inference', 0.007831399683834164), ('sampling', 0.00614168314645444), ('probability', 0.0056646979082951516), ('sample', 0.005618290983450382)]
[('condensate', 0.03563380736937094), ('boseeinstein', 0.03281346083163922), ('bose', 0.024604853303908294), ('condensates', 0.01590529015896842), ('condensation', 0.010063618540648899), ('atoms', 0.009020629389566267), ('solitons', 0.007574204763128568), ('gases', 0.00750277997557609), ('atomic', 0.005728053771570647), ('soliton', 0.005496361596549815)]
[('wireless', 0.043265644969453974), ('network', 0.026624172093601146), ('networks', 0.022602367756822363), ('nodes', 0.020314901727286453), ('routing', 0.012198552511664502), ('relay', 0.011443673627123568), ('transmission', 0.010778962688679476), ('protocols', 0.007792765016682918), ('coding', 0.007262368994436775), ('packet', 0.007088404364556387)]
[('algebra', 0.009508399679971644), ('spaces', 0.00874683359537929), ('algebras', 0.008621164547943608), ('finite', 0.00766028321891464), ('manifolds', 0.007345115657771375), ('finitely', 0.006817928828699886), ('manifold', 0.006794802606806686), ('metric', 0.005758919093004792), ('theorem', 0.005549398118729717), ('cohomology', 0.005329239199870784)]
[('financial', 0.02778448518234804), ('stock', 0.021005221083078732), ('asset', 0.014279006519891203), ('portfolio', 0.013997439694449176), ('pricing', 0.013211643593096296), ('trading', 0.013156962940594492), ('investment', 0.008495616180998393), ('stocks', 0.008361230495723947), ('assets', 0.008236565953627702), ('insurance', 0.005971728435419685)]
[('interference', 0.024713428079488675), ('receiver', 0.013857175281490372), ('transmitter', 0.01360526856123733), ('transmit', 0.012691929298860245), ('coding', 0.011635803821757558), ('transmission', 0.011566654683392375), ('antennas', 0.01123231845041445), ('broadcast', 0.008593780357944728), ('beamforming', 0.008552448819680583), ('receivers', 0.008459943014328328)]
[('hubbard', 0.053493199507069066), ('lattice', 0.010758592075718717), ('fermi', 0.0072053076967158935), ('antiferromagnetic', 0.007181871258321772), ('insulator', 0.007051933547436705), ('approximation', 0.006036515701808201), ('coupling', 0.00588917432860099), ('halffilling', 0.0057487305439479375), ('interactions', 0.005587492856648944), ('correlations', 0.0055763764430453375)]
[('function', 0.008651916818150886), ('denote', 0.007809245922609828), ('set', 0.007478684066622141), ('integer', 0.007374301037950408), ('integers', 0.0069542814232360205), ('omega', 0.006667642786285545), ('mathbb', 0.0060961646394778754), ('alpha', 0.006039583618754279), ('functions', 0.006017594883429485), ('bounded', 0.005814677932831883)]
[('inflation', 0.05931669348358714), ('inflationary', 0.02232483204321207), ('universe', 0.013689775008603551), ('perturbations', 0.012286552051758052), ('cosmological', 0.010828376490552719), ('slowroll', 0.009469809586834922), ('gravitational', 0.009136121565873076), ('gravity', 0.0071959489811766865), ('fluctuations', 0.006429827629092597), ('hubble', 0.005171651035312848)]
[('planets', 0.046814197426740575), ('planet', 0.03501459176392504), ('planetary', 0.018808237468975954), ('orbits', 0.011441766784746725), ('orbit', 0.008467096082013886), ('jupiter', 0.0077181769914923416), ('planetesimals', 0.0073672413745058335), ('asteroid', 0.0067335901530106036), ('exoplanets', 0.006022745742365477), ('asteroids', 0.00585370806168899)]
[('solar', 0.039421087365066554), ('sun', 0.011732411452529146), ('sunspot', 0.010430314535179447), ('photosphere', 0.009067243547239295), ('photospheric', 0.00865958912310833), ('convection', 0.008064198950269004), ('atmosphere', 0.0072483168826311725), ('observations', 0.006876844797981044), ('chromosphere', 0.006262490703920049), ('heating', 0.006055562210536541)]
[('observations', 0.008699188735815778), ('galaxies', 0.008407485953038762), ('galaxy', 0.008073430332696623), ('luminosity', 0.007162840940254354), ('stars', 0.006966175063807207), ('stellar', 0.005349763782737194), ('telescope', 0.005287464577458983), ('spectra', 0.005136408088761508), ('optical', 0.005035509490418128), ('ngc', 0.004829665052611631)]
[('neurons', 0.038221238895353046), ('brain', 0.02920640451360389), ('neural', 0.02446339583121359), ('neuronal', 0.01643279040940504), ('spike', 0.014812396510321582), ('neuron', 0.012965641652609569), ('cortex', 0.007331964185225165), ('spikes', 0.006943257568279404), ('cells', 0.00632549932973966), ('fmri', 0.005463485659527367)]
[('algebras', 0.029831387237084984), ('algebra', 0.024734690221066107), ('homotopy', 0.015433122770034542), ('cohomology', 0.015197981658463339), ('sheaves', 0.009816480887828655), ('algebraic', 0.009377566200273487), ('functors', 0.00822133504218289), ('complexes', 0.006714333078347921), ('theorem', 0.006619095606144724), ('groupoids', 0.006472502953402093)]

One thing I noticed it that the HDBSCAN clustering with the abstracts takes about 1 day (which is ok), but when taking only the titles (which are much shorter then the abstracts) the clustering takes more than a week (I aborted it so it could even be longer). I tried this two times and both times it seemed to run forever. Do you have an idea why shorter texts take so much longer then longer ones?

Best Karol

MaartenGr commented 3 years ago

Great! The topic representations definitely seem much better now. I am not entirely sure, but it seems that calculating the probabilities might actually be an inefficient step that increases the computational time. Have you tried to set calculate_probabilities to False? This could speed up the model significantly!

Karol-G commented 3 years ago

I will set calculate_probabilities to false and see what happens. What is calculate_probabilities used for? Isn't it used for calculating the top N words describing a topic or am I mistaken?

MaartenGr commented 3 years ago

No, it is used to find the soft-clustering output from HDBSCAN. It is likely that calculating the probabilities, especially when you have hundreds of topics and millions of documents, takes quite a while to finish. Setting it to False completely skips this step such that computation will be faster.

Karol-G commented 3 years ago

Sorry for the late reply. It is really fast now, thanks!

MaartenGr / BERTopic

Finetune on arxiv dataset #27