deven / PDF-Data

PDF::Data - Manipulate PDF files and objects as data structures
3 stars 0 forks source link

Missing files/info in CPAN distribution #1

Open esabol opened 3 months ago

esabol commented 3 months ago

Hi, Deven. I watched your TPRC video, and I'm interested in your Perl module! I need to validate and combine PDFs a lot, and I'd like to do that in Perl, but I've been using other tools such as pdftk and such. So I checked out your module on CPAN....

Your Perl distribution lacked any information on the GitHub repository, but I found it after several searches here on GitHub. Please add a CONTRIBUTING or CONTRIBUTING.md file with this information. Reference:

https://metacpan.org/dist/PDF-Data/contribute

I'd suggest adding a Changelog, too. It's the first thing I look for with any module. Kwalitee also complains that the license isn't included in the source code. I expected it to be in the pod text myself. Reference:

https://cpants.cpanauthors.org/release/DEVEN/PDF-Data-v1.1.0

I think you forgot to tag your most recent CPAN release here on GitHub also:

https://github.com/deven/PDF-Data/releases

This is unrelated to this issue, but please indulge me. Let's say you wanted to append one PDF to another PDF. Can you do that with PDF::Data? If yes, how? I see the append_page() method. Loop over the pages in the second pdf and call append_page on the first PDF? Maybe add a more convenient method to do that? I guess you could call that a feature request. If so, I'll open another issue for that if you are receptive. Thanks!

deven commented 3 months ago

Hi Ed! Thanks for taking an interest in PDF::Data and for the ++ as well!

I just also noticed myself that the repository link was missing. I thought I had included such things when I made the original release, but apparently not. I'll have to revisit that. Thanks for the heads up!

Can you think of another module that has all the metadata just right? It might be useful as a reference!

As for appending one PDF to another, that could be done, but you're right, it could be made easier. That's a good idea; feel free to create such feature requests. Perhaps the simplest solution would be to extend the current API to allow append_page() to accept multiple page objects instead of just one, and maybe also allow page tree nodes or entire PDF documents as arguments?

The part of appending one PDF to another that might be a little tricky is to handle the case when the PDF has a tree of page nodes instead of just putting all the pages in a single root node of the page tree. It might help if I had PDF::Data automatically flatten the page tree when it loads the data, though it's possible that having a thousand pages in the root node might have some detrimental effects?

At any rate, if you load two different PDFs into memory, you can readily copy data from one to the other. If you don't intend to save the second PDF at all, you can just move pages over by calling $pdf->append_page() with each page object from the second PDF. You could also call $pdf2->copy_page() first to deep-clone the page, although I suppose that could cause redundant copies of shared resources to be created actually. Given that issue, I would recommend just moving the pages and consider $pdf2 to be unsuitable for attempting to save again.

This should work if the page tree is flat:

use PDF::Data;
$pdf = PDF::Data->read_pdf("original.pdf");
$pdf2 = PDF::Data->read_pdf("extra.pdf");
$pdf->append_page($_) foreach @{$pdf2->{Root}{Pages}{Kids}};
$pdf->write_pdf("combined.pdf");

If you are trying to read a PDF file which is PDF 1.5 or later using object streams and cross-reference streams, PDF::Data doesn't know how to process those yet, but you can try loading it like this if you have qpdf installed:

$pdf = PDF::Data->parse_pdf(scalar `qpdf --force-version=1.4 original.pdf -`);
esabol commented 3 months ago

Can you think of another module that has all the metadata just right? It might be useful as a reference!

DateTime is one, but Dave uses Dist:Zilla, so I'm not sure that's a good reference: https://metacpan.org/dist/DateTime

Hmm, maybe DBD::Pg would be a good reference for you: https://github.com/bucardo/dbdpg

Also, check out CPAN Digger. It has links to useful documentation on how to add these things. Most of this stuff goes in a META.yml or META.json file, I think.

https://cpan-digger.perlmaven.com/dist/PDF-Data

https://perlmaven.com/how-to-add-link-to-version-control-system-of-a-cpan-distributions https://perlmaven.com/how-to-add-the-license-field-to-meta-files-on-cpan

And this might be useful:

https://metacpan.org/pod/CPAN::Meta::Spec

I also recommend adding a GitHub Actions CI workflow:

https://perlmaven.com/setup-github-actions

Reference CI you can easily adapt: https://github.com/sciurius/perl-Text-Layout/blob/master/.github/workflows/ci.yml

Thanks for the quick how-to on appending one pdf to another! I wasn't familiar with qpdf before your talk, so thanks also for that. I will be checking it out.

esabol commented 1 month ago

Great to see another release! Please add a Changes file to the distribution (and repo).

Oh, there's a CHANGELOG.md, of course. I don't think the name matters. The problem is that CPAN doesn't link to it, and it comes down to missing metadata in the distribution.

deven commented 1 month ago

I rebased the entire history of the project and retroactively added the CHANGELOG.md file, and also added signed annotated tags for all release versions and added all releases to GitHub's list of releases.

Note that the distribution tarballs currently on CPAN do not contain this CHANGELOG.md file, but it will be included in the next release and I'll try to make sure the CPAN metadata is updated as well.

deven commented 1 month ago

Also, note that all of the releases now include the changelog entries for each release:

https://github.com/deven/PDF-Data/releases

esabol commented 1 month ago

Also, note that all of the releases now include the changelog entries for each release:

https://github.com/deven/PDF-Data/releases

Yeah, I saw that! Good stuff! 👍