Open ThiloteE opened 3 years ago
When i push
@Test{test,
author = {test},
date = {2021},
file = {:Thompson (2020-03-30) middle-class-remorse-re-embracing-liberal-democracy-in-the-philippines-and-thailand.pdf:PDF},
title = {test},
}
to the pdf file, there is still some leftover XMP metadata, as can be seen here:
The current Jabref implementation is good, if one wants to keep this old metadata, but if one wants to remove it, then this is currently not possible via Jabref.
Related fruitful discussion on https://gist.github.com/hubgit/6078384, but i am not sure in how far tools like these can be implemented with Jabref.
Current "feature" of JabRef: Remove specified fields. Maybe this helps for your case somehow? When listing "language", "number", ... in the fields to clear? (Feature highly requested by @adaerr few years ago)
Yes! Indeed, Koppor, this is a step in the right direction. Thanks! I did some tests:
These conditions need to be fulfilled for Metadata to be deleted:
Do not write the following fields to XMP Metadata
needs to be tickedExample:
@Test{test,
author = {test2},
date = {2021},
file = {:test/test2 (2021).pdf:PDF},
language = {korean},
number = {2},
title = {test},
}
If i put Number
into the list while Do not write the following fields to XMP Metadata
is ticked, it will delete the metadata instead of writing the number 2.
If i do the same, except removing the number field from the entry (number = {2},
) it will NOT delete the metadata.
So we have found out how to delete metadata with Jabref and the way it currently works, it allows very fine grained usage. This is good. It is not perfect, but it is good. A possible and low hanging fruit for improvement would be to ease the workflow by cutting down on the conditions that need to be fulfilled to delete something, especially condition 3 seems tedious.
The next question i asked myself: How then would it be possible to delete the metadata not only for a single entry but for ALL pdf files i have linked to the entries within my library?
My aim is to substitute all the 'bad' and 'false' metadata that is currently attached to my pdfs with the (maybe not perfect, but at least ... ) more correct metadata i got from importing via DOI and manual corrections.
Prototype (untested) workaround
main file directory
.Problems with workaround:
Edit:
Proper solutions would be:
Specify in the preferences which metadata should be deleted. Give option to delete all the metadata JabRef is able to write. (e.g. at least all bibtex fields). Then:
Entry based solution(s): A)
Strength of this solutions:
and / or
Folder based solution(s): B)
options > preferences > linked files
Strength of this solution:
How to do this, I don't know.
I favour an entry based solution. The current tedious method as explained in https://github.com/JabRef/jabref/issues/8277#issuecomment-987540344 is also an entry based solution. If push comes to shove, Exiftool and other tools exist that are folder based, so I think it is alright if JabRef goes the entry based direction.
Hi, I am new to open source and would like to contribute to the project. Is it okay for me to try working on this issue?
Hey, I edited my last comment!
Yes you may :) Thanks for your interest!
Check out https://github.com/JabRef/jabref/blob/main/CONTRIBUTING.md for a start. Also, https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace is a good start. Feel free to ask if you have any questions here on GitHub or also at gitter.
Try to open a (draft) pull request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.
Noted. I will open a draft PR once I make some progress.
For testing your changes, I can recommend ExifTool. https://www.exiftool.org/
ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files.
In the FAQ it is explained how to extract (read) really all available metadata that is attached to an PDF
"How do I extract absolutely all metadata from a file?"
By default, duplicate tags, unknown tags, embedded tags, and System tags that require external utilities are not extracted. The main reason for this is performance; extracting these tags will significantly increase processing time for some files. The following command extracts everything possible with ExifTool:
exiftool -ee3 -U -G3:1 -api requestall=3 -api largefilesupport FILE
(The -G3:1 option is included in the above command only to give an indication of where the metadata was stored.)
Some code was done at https://github.com/JabRef/jabref/pull/8681, however, the contributors did not continue working on it. Potential contributors can use the code and the discussions as basis.
This is a CleanUpJob
similar to org.jabref.logic.cleanup.MoveFilesCleanup.
Problem: My goal was to write XMP metadata to a pdf, but there was metadata attached to the pdf already (because i already attached another entry to it once, but that entry had wrong data), so now there is data that i do NOT WANT to have attached + the data that i WANT to have attached.
Describe the solution you'd like Add a feature that allows to remove all XMP metadata from one or multiple pdfs.