SoftVarE-Group / MibTeX

Minimalistic tool to manage your references with BibTeX
GNU Lesser General Public License v3.0
2 stars 1 forks source link

Make sure that first article found with Google Scholar is sufficiently similar to the paper title #16

Closed tthuem closed 7 years ago

tthuem commented 8 years ago

I would recomment to use the edit distance and check that it is smaller then titleLength/5. Otherwise, store -2 as the paper is essentially not found, then.

Kogoro commented 8 years ago

Do you have an example where it does not fit the first so that the implementation can be tested against this entry?

tthuem commented 8 years ago
@inproceedings{PTF+:SPLC16,
    author = {Pfofe, Tristan and Th\"um, Thomas and Fenske, Wolfram and Schulze, Sandro and Schaefer, Ina},
    title = {{Synchronizing Software Variants with VariantSync}},
    booktitle = SPLC,
    publisher = ACM,
    address = NY,
    year = 2016,
    note = {To appear}
}
Kogoro commented 8 years ago

Is implemented through a run-time pattern so that the title must be included in the specific element. On the other side, this pattern makes the gathering weak against HTML changes.

tthuem commented 8 years ago

Seems to work now.

tthuem commented 8 years ago

Almost all entries are now considered as not found on Google Scholar. Many already found articles are now updated by -2, which does not make sense.

Kogoro commented 8 years ago

Actual I have this pattern: /<div class="gs_r">.*The Coq Proof Assistant Reference Manual.*Cited by (\d*).*<\/div><\/div>.*/igu

, but I identified some problems with the new pattern at run-time:

  1. The title is not the same, so the pattern does not match
  2. The old citations are false positives and so many more citations will be -2, happens often with entries which aren't papers.
  3. For entries like "The Coq Proof Assistant" the service finds multiple matches and only takes the first
  4. Another extreme case is "The Coq Proof Assistant Reference Manual" where multiple versions exist

1 and 2 could only be solved through removing the title again, but that would increase the false positives again. For 3 and 4 we may calculate an average or take only the biggest citation count. What do you think about it?

tthuem commented 8 years ago

Please revert the changes for now, as it worked better before.

Then, I looked into wrong cases and it seems that it is a matter of different capitalization. As I said, this only makes sense, if we rely on edit distance and do not compare exactly.

Examples:

"SDTF:TOPLAS13";"Contracts for First-Class Classes";20;1462821710040;
"H:FMCO13";"The Abstract Behavioral Specification Language A Tutorial Introduction";11;1462822911497;
"AKS+:FOSD13";"Exploring Feature Interactions in the Wild The New Feature-interaction Challenge";12;1462824113296;
"BRN+:VaMoS13";"A Survey of Variability Modeling in Industrial Practice";117;1462825315393;
"SRA:GPCE13";"Family-Based Performance Measurement";10;1462826517458;
"GBC+:ESECFSE13";"Incrementally Synthesizing Controllers from Scenario-Based Product Line Specifications";14;1462827718995;
"GSCH:REJ13";"Features Meet Scenarios Modeling and Consistency-Checking Scenario-Based Product Line Specifications";8;1462828920680;
"ABT+:REJ13";"Evaluating Scenario-Based SPL Requirements Approaches The Case for Modularity, Stability and Expressiveness";6;1462830122547;
"HS:SC13";"Reusable Components for Lightweight Mechanisation of Programming Languages";2;1462831324311;
"MRG:GPCE13";"Investigating Preprocessor-Based Syntax Errors";19;1462832525507;
"MRKN:iFM13";"Compositional Verification of Software Product Lines";17;1462833727130;
"BDS13";"Compositional Type Checking of Delta-Oriented Software Product Lines";23;1462834928447;

results in

"SDTF:TOPLAS13";"Contracts for First-Class Classes";-2;1466613906166;
"H:FMCO13";"The Abstract Behavioral Specification Language A Tutorial Introduction";-2;1466615136092;
"AKS+:FOSD13";"Exploring Feature Interactions in the Wild The New Feature-interaction Challenge";-2;1466616364571;
"BRN+:VaMoS13";"A Survey of Variability Modeling in Industrial Practice";-2;1466617592389;
"SRA:GPCE13";"Family-Based Performance Measurement";-2;1466618821484;
"GBC+:ESECFSE13";"Incrementally Synthesizing Controllers from Scenario-Based Product Line Specifications";-2;1466620050393;
"GSCH:REJ13";"Features Meet Scenarios Modeling and Consistency-Checking Scenario-Based Product Line Specifications";-2;1466621279027;
"ABT+:REJ13";"Evaluating Scenario-Based SPL Requirements Approaches The Case for Modularity, Stability and Expressiveness";-2;1466622512765;
"HS:SC13";"Reusable Components for Lightweight Mechanisation of Programming Languages";-2;1466623739889;
"MRG:GPCE13";"Investigating Preprocessor-Based Syntax Errors";-2;1466624967417;
"MRKN:iFM13";"Compositional Verification of Software Product Lines";-2;1466626194573;
"BDS13";"Compositional Type Checking of Delta-Oriented Software Product Lines";-2;1466627422731;
Kogoro commented 8 years ago

I uploaded 1b2bbd50ce4fa6fa2189d5d01dd795adf7382b39 as a fix for this issue. I have implemented a search over all results and the best possible fit by the Levenshtein-Distance is taken. After more than 3 hours, no more -2 or other errors. The results are also similar or better than with the cited search only.

tthuem commented 8 years ago

Looks better, but still wrong results for

Advanced Compiler Design and Implementation
Structure and Interpretation of Computer Programs
Proof-Carrying Code
´´´

The last is a tough one, as the title is actually wrong in Google Scholar. I can live with the last one, but one is wrong with the other two?
tthuem commented 8 years ago

Was closed automatically, but problem is still there.

Kogoro commented 8 years ago

I improved the searching with another regular expression which filters the entries first before trying to match the citations. Please check if it works for you. You can change the levenshteinParameter too. In my tests with the above mentioned entries and also the normal list, it works pretty well with 10% word changes allowed.

tthuem commented 7 years ago

Seems to work fine.