acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
433 stars 292 forks source link

Site format inconsistency across articles #4057

Open shyousefi opened 2 days ago

shyousefi commented 2 days ago

Confirm that this is a metadata correction

Anthology ID

2024.signlang-1.3

Type of Paper Metadata Correction

Correction to Paper Title

No response

Correction to Paper Abstract

The site format varies across articles in this particular link (along with others). For some articles, accessing the abstract is not possible through this page. When scraping the page, the abstract line remains empty. To obtain the abstract, you must navigate to the individual article link, as the abstract is unavailable on this page. In such cases, extracting the PDF becomes necessary.

Correction to Author Name(s)

No response

mbollmann commented 1 day ago

Whether abstracts appear on the website or not depends on the metadata the workshop organizers supplied us; we don’t scrape PDFs, for example, to get the abstracts. The inconsistency between the volume page and the individual paper pages is something that ideally shouldn’t happen, though.

However, I would really not recommend scraping the web pages at all — you can extractly all information directly from our XML files or access them through our Python library.

shyousefi commented 1 day ago

Thank you very much for your reply.

On Fri, Nov 15, 2024 at 4:13 PM Marcel Bollmann @.***> wrote:

Whether abstracts appear on the website or not depends on the metadata the workshop organizers supplied us; we don’t scrape PDFs, for example, to get the abstracts. The inconsistency between the volume page and the individual paper pages is something that ideally shouldn’t happen, though.

However, I would really not recommend scraping the web pages at all — you can extractly all information directly from our XML files https://github.com/acl-org/acl-anthology/tree/master/data/xml or access them through our Python library https://acl-anthology-py.readthedocs.io/en/stable/.

— Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/4057#issuecomment-2478731744, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2IR6TNWQ25SCPE45SFFSD32AXT7PAVCNFSM6AAAAABRZ26OJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZYG4ZTCNZUGQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Shahin Yousefi (Ms.)

Research Assistant (NLP)

Faculty of Computer Science

Institute for Advanced Studies in Basic Sciences (IASBS)

Zanjan 45137-66731

Iran

T: (+98) 914 561-5536 (cell)

E: @.***