ReScience / submissions

ReScience C submissions
28 stars 7 forks source link

Ten Years Challenge: [Rp] Typographical features for scene text recognition #35

Closed weinman closed 3 years ago

weinman commented 4 years ago

Original article: Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE. DOI:10.1109/ICPR.2010.970

PDF URL: http://www.cs.grinnell.edu/~weinman/tmp/rescience/rescience20_submitted.pdf Metadata URL: http://www.cs.grinnell.edu/~weinman/tmp/rescience/metadata.yaml Code URL:

Scientific domain: Pattern Recognition (Machine Learning) Programming language: CUDA, Matlab, Java, C Suggested editor: Thomas Arildsen, Lorena Barba, Georgios Detorakis

The code archive DOI is being processed. It may not appear for 48–72 hours.

Some of the LaTeX output/layout seems a bit wonky. If anyone would like to make behind-the-scenes suggestions, they may view the source at overleaf

Original comment/submission

rougier commented 4 years ago

@ThomasA Could you edit this submission fior the Ten Years Reproducibility Challenge (only one reviewe needed)?

rougier commented 4 years ago

@weinman Thanks for your submission, we'll assign an editor soon.

rougier commented 4 years ago

@koustuvsinha @gdetor Can you edit this submission for the Ten Years Reproducibility Challenge (only 1 reviewer needed)?

gdetor commented 4 years ago

Hi @rougier I can handle this submission.

rougier commented 4 years ago

Oh great, thank you

gdetor commented 4 years ago

Hi @mlosch Could you please review this submission?

ThomasA commented 4 years ago

@rougier sorry I was not quite "awake" at the moment. I was quite busy lately and I am afraid a lot of ReScience communication hid in heaps of GitHub threads that just kept piling up - most of it papers that I was not involved in.

gdetor commented 4 years ago

@mlosch Gently reminder.

rougier commented 4 years ago

@gdetor If@mlosch is not available, I can review.

gdetor commented 4 years ago

@rougier thank you.

rougier commented 4 years ago

I've started and I will try to make my review for this Friday.

rougier commented 4 years ago

Overall

This is quite a fascinating work with a really complex setup and lot of software/hardware dependencies. When you read the original article (a conference paper with a page limit I imagine), you have no idea of the whole machinery behind to obtain the results. This resonates strongly with Claerbout famous quote "an article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.”. Here, the author gives us really precise and insightful explanations on the whole machinery. It seems to be almost magical that he managed to reproduce it but the "trick" is the use of an experimental data repository setup in 2010. This reproduction 10 years later seems to indicate it's quite a good structure for reproducible research. I did not try to re-run the software because obviously you need a CUDA environment (that I don't have currently) and the whole pipeline produces several gigabytes of data. I have only one "major" comment/suggestion: In the introduction, the author dives directly into the details of the experimental data repository without first giving an (even small) overview/context of the origjnal results. That might be a good strategy because I stopped reading a this point to read the original article and later come back to this one, but I think it would be good to give some infomation on the original article (even if all the details are later given).

I've also a few minor comments/suggestions (see below) that would need to be addressed but I think the article is already in good shape.

Minor comments

rougier commented 4 years ago

Forgot to notify @gdetor and @weinman

gdetor commented 4 years ago

@rougier Thank you for your review.

gdetor commented 4 years ago

@weinman Gentle reminder

rougier commented 4 years ago

@gedtor @weinman Any progress ?

weinman commented 4 years ago

Thank you @rougier for the helpful review. I was sidelined last month by a natural disaster, but I'm now picking up this thread.

Major suggestion

I'm very happy to include insert a new section after the introduction giving some additional background and context for the original work. I agree that makes a lot of sense! I've drafted 4–5 paragraphs along these lines.

Because the focus of the article (I had thought) would be on the reproduction task, I thought leaving the introduction (having that focus) intact might make sense. However, I am also amenable to repeating what is given in the abstract, if the reviewers find that a useful addition. Specifically:

The original 2010 paper demonstrated that character recognition performance could be improved on difficult problems of scene text recognition by leveraging font-specific correlations between character identity and width.

I will be fiddling with pagination and figure location, but the draft is already visible in the overleaf (linked in the original post).

Minor suggestions

Starting the reproduction

This is a good point! I will add to the article that my submitted works sources always includes a pointer from tables or figures to the experimental collections that generated them. Thus, I always have a pointer to the leaf/leaves of the dependency tree (Fig. 1), from which I can work backwards.

There was no original README because the computational aspect of the work was not previously published. (Though there was always intent.)

Link to Source

The accompanying metadata YAML file does indeed insert the link in the footer on page 1 ("Code is available at ..."). Is there an additional location and/or preferred method for relaying this information?

Browsable source

It's certainly possible to include this. I had limited myself to linking to the permanent archive, because it's the only URL I trust to still be functional in say 10 years (unlike github or my own institution-hosted personal web site). Unfortunately, my institution's digital archive does really facilitate file-level browsing.

I welcome suggestions for other platforms that might blend both permanence and browsability.

Caveats in Conclusion

Agreed. Inserting a new penultimate sentence would function nicely as a transition between the "Although" and "This article indicates". The caveat(s) primarily stem from the fact that we don't have a complete docker-like archive of the host platform, which I am happy to recapitulate.

Abstract Citation

This is indeed one of the technical LaTeX issues I was wondering about. This is automatically generated, but I'm perhaps doing something wrong, the metadata.yaml contains the following:

# Information about the original article that has been replicated
replication:
 - cite: "Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE." # Full textual citation
 - bib:  \cite{weinman10typographical} # Bibtex key (if any) in your bibliography file
 - url:  https://www.cs.grinnell.edu/~weinman/pubs/weinman10typographical.pdf # URL to the PDF, try to link to a non-paywall version
 - doi:  10.1109/ICPR.2010.970 # Regular digital object identifier

And this produces the following items in metadata.tex

\def \replicationCITE{Weinman, J. J. (2010, August). Typographical features for scene text recognition. In 2010 20th International Conference on Pattern Recognition (pp. 3987-3990). IEEE.}
\def \replicationBIB{\cite{weinman10typographical}}
\def \replicationURL{https://www.cs.grinnell.edu/~weinman/pubs/weinman10typographical.pdf}
\def \replicationDOI{10.1109/ICPR.2010.970}

I do welcome assistance on getting this formatted as the editors intend! (I agree that a bare numerical citation is undesirable.)

Verbatim Texts

Good typographical eye! (Appropriate, considering the subject of the original paper.).

Indeed, I actually shrunk the name of the very long collection (eighth line of section 2,) so it didn't stick out in the margin too far. However, Makefile (which comes later) and all the rest are the appropriate default size in latex; juxtaposed these do look odd, but careful inspection of the x-height for Makefile reveals it matches the default/serif text.

Collections shorthand

Admittedly, I invented the shorthand pA, eC so that the graph (Fig. 1) would be interpretable, and then found it convenient to continue to use these shorthand keys throughout. The complete names I used in the file system would perhaps be overkill (i.e., experiments/text/ngrams/bigrams/tied_nums_intracase_L1_validation-20090708075734) for casual reference throughout the manuscript, which is why I tried to describe the relevant collections before naming them specifically with these keys. They can be cross-references in the graph (Fig. 1) or table for more details (Table 3).

I agree it's not ideal, but given the shear number and variety of them, I do not know whether it is worth inventing a new taxonomy. Section 3.1 describes the contents as they are enumerated. Table 2 attempts to use the semantically-meaningful prefix character (p, e, or r) to glob these along with the header description ("Parser", "+Training", "+Data").

A bit of "inside baseball", but another minor reason for this somewhat generic naming scheme (aside from keeping the graph tidy) is that the dependency graph structure and indeed the graphic itself are all automatically and reproducibily generated. (By tracing the DEPs files, processing the hierarchy location, and generating a file that can then be used by GraphViz.

Conclusion

I will post a revision in the next week or so, but I do invite responses to any of the comments I've made above that may help smooth the process going forward. Thank you again for hosting the challenge, giving impetus for this evaluation, and providing the very helpful review.

rougier commented 4 years ago

@weinman Thank for yo very detailed report and I'm plainly satisfied with the proposed corrections. To answer your question about the code, you can actually expand it on GitHub (or GitLab) and use software heritage (https://www.softwareheritage.org/save-and-reference-research-software/) to save the repository. Even if GitHub disappear sometime in the future, your code will be safe (software heritage is a non-profit foundation). You should obtan a swh id that you can put in the metadata.

For collection shortnames, I agree the long names would be tedious to use an I can live with the short version.

For the abstract, I think you can use \textcite instead of \cite but I'm not sure if the overleaf template is up to date. Maybe you can upload the up-to-date template from https://github.com/ReScience/template

@gdetor We're now waiting for the final version but I formally accept the submission.

gdetor commented 4 years ago

@rougier Thank you for the review. @weinman Congratulations. Once you upload the final version I'll proceed with publishing the paper.

weinman commented 4 years ago

@rougier Thank you for the feedback.

Regarding the "A replication of [1]" beneath the abstract. I've updated the template to the latest version and changed metadata.ytml/metadata.tex to use \textcite, but all this does is change it to "A replication of Weinman [1]".

Any further pointers or suggestions on how to make the output match the editors' desired format are very welcome!

rougier commented 4 years ago

Can you try \fullcite instead ?

weinman commented 4 years ago

Yes, \fullcite produces a complete citation.

The metadata.yaml is peculiar on this point:

# Information about the original article that has been replicated
replication:
 - cite: # Full textual citation
 - bib:  # Bibtex key (if any) in your bibliography file
 - url:  # URL to the PDF, try to link to a non-paywall version
 - doi:  # Regular digital object identifier

My initial interpretation would be that bib is simply the bibtex key (i.e., it's weinman10typographical in my bibtex file). And cite would be the manually extracted text (I'd copied mine from what appeared in the ultimate reference list).

However, in order to produce what you've requested, the bib field above needs to be \fullcite{weinman10typographical}.

Maybe I'm misunderstanding something. However, if you too think that's peculiar, let me know if you'd like me to open an issue about it on the template

rougier commented 4 years ago

Now that you've pointed it, I'm not sure what I meant when I created the template and it might be worth to open an issue. I think this might be related to the old template we used some years ago.

weinman commented 4 years ago

Thanks. I have some other formatting issues/questions.

SWH

I've entered everything into SWH and gotten an identifier for it. I've entered the value into metadata.yaml, which generates the appropriate \def \codeSWH line in metadata.tex, however, \codeSWH does not seem to get used anywhere by the template (I'm looking particularly at the footer where it says "Code is available at", where gives it the DOI (which I'd certainly like to keep in there).

Am I to manually cite it within the body of the paper? (Fine if so, just wanted to double-check!) Or is there something else I am to do?

Paragraphs

The paragraph formatting of the paper seems a bit odd, there is neither an extra space between lines nor an indentation for new paragraphs.

This behavior was observable in the submitted PDF as well. Is it expected? It seems unusual to me, as typically there is either one or the other (extra space or indentation). I just wanted to verify.

Header "Replication"

The header says "A replication of" and I just wanted to be sure this is right, since the prefix in the title is [Rp] as opposed to [Re]. I'm not sure if replication applies to both or if there is another word that should appear there (i.e., "reproduction"?).

Sorry for so many questions!

rougier commented 4 years ago

The \codeSWH is supposed to be used in header.tex. Maybe you need to update the template you're using. And if you started from overleaf, this template might be outdated and needs to be updated.

For the missing space, I also noticed that and forgot to fix it. It's only a matter of using \usepackage{parskip} that should be added to the template (can you make a PR?)

The header should say "Reproduction" in your case. You have to change the type in the metadata (where this specific option is missing in the comment.

And many thanks for all your comments, your expert eye will help us to improve the template.

weinman commented 4 years ago

Thanks @rougier !

I updated all the files (it turns out that before I'd only done a partial update of just rescience.cls) in the template. This solved some problems but introduced others.

That last point specifically is what I was asking about. The title/header indeed appears to be correct, it's the note under the abstract I wondered about. Should I make some change or is this the text expected?

I'd be happy to submit one PR that adds the package and uncomments the needed header elements (unless they're intended to be commented?).

rougier commented 4 years ago

That would be great. For the abstract and replication lines, I think I commented them because we can now have editorial and letters that do not have abstract nor replication reference. The abstract inclusion should be conditional and same for replication/reproduction. If we have a bib ref for the reproduction/eplication, then we can add it with the \fullcite, else we skip the line?

Metadata is missing the reproduction yes, good catch. For the hardcoded part, we could use the type of the submission since the line would only appear for replication/repoduction?.

If you make a PR, it would be good to reference this thread and maybe we can continue the discussion on the PR. Else, we'll continue to pollute your review 😄

weinman commented 4 years ago

@rougier Thanks again for the review and editorial guidance. @rougier and @gdetor here's the summary of changes:

As I understand it, the editor adds the final touches (DOI, etc.). Thus, the entire document source may be found at Overleaf. If there is some other way I should deliver it, please let me know!

gdetor commented 4 years ago

@weinman @rougier I'll proceed to the final editing and publishing the article

gdetor commented 4 years ago

Hi @weinman Could you please add the following information to the metadata.yaml file, compile the article and update the overleaf so I can get the latest version of those files? Volume: 6, Issue: 1 DOI: 10.5281/zenodo.4091742 URL: https://zenodo.org/record/4091742/files/article.pdf Please correct the name of the reviewer to Nicolas and remove the tag "preprint" from the manuscript. Thank you.

weinman commented 4 years ago

Thanks @gedetor! Glad we're nearly there. I can edit those things. Should I leave the dates (so you update them), or shall I insert them? Perhaps as follows?

dates:
  - received: April 30, 2020
  - accepted: September 12, 2020
  - published: October 15, 2020

I also wasn't sure what to give for the article number.

In any case, I've added the volume, issue, DOI, and URL (not sure where that shows up), made the name correction (oops!), and the Preprint has disappeared with the definition of the DOI.

Please let me know what details I surely still need to attend to.

gdetor commented 4 years ago

@weinman Sure you can update the dates too. I think that's the last piece of information missing from the manuscript. The number will be assigned automatically upon submission. Thank you

weinman commented 4 years ago

@gdetor Very good, I've made those updates. If you find anything missing, please let me know!

gdetor commented 4 years ago

@weinman Congratulations once again. The article is now online https://zenodo.org/record/4091742