UAlbanyArchives / mailbagit

A tool for creating and managing Mailbags, a package for preserving email using multiple preservation formats
https://archives.albany.edu/mailbag/
MIT License
46 stars 2 forks source link

Filtering mechanism for sender/recipient emails #28

Open gwiedeman opened 3 years ago

gwiedeman commented 3 years ago

The problem the component solves

Requirement #35: "Provide a method of keeping or excluding specific email folders while creating Mailbags."

A user may like to exclude a list of email folders, like "Drafts," "Trash," etc., or potentially a list of native Message-IDs to exclude individual emails from a mailbag. I imagine this would be a command line path to a text file or something. If this option is used, Mailbag must exclude these folders or messages when creating derivatives and documenting metadata in mailbag.csv, bag-info.txt and other tag files. The hard part is that these would also have to be excluded from the source file, like the MBOX or PST file that gets save in a mailbag as well.

The specification also has a space for documenting these exclusions in folders_not_retained.txt or messages_not_retained.txt

Relevant part of mailbag spec?

5.4 folders_not_retained.txt and messages_not_retained.txt

Type of component

Expected contribution

Major challenges or things to keep in mind

We'll have to exclude these folders or messages in the source files as well if we're keeping them in the output mailbag. The idea is that email often has to be excluded for legal or privacy reasons, so we don't want this data in the mailbag at all. Edit an mbox or removing EML/MSG files should be feasible, but editing a PST I expect to be a problem,

gwiedeman commented 2 years ago

@haritgarg Next you should look into editing MBOX and PST file or potentially creating them from scratch to see how feasible this would be.

gwiedeman commented 2 years ago

Exclusions should be done in the controller and affect writing to CSV and derivatives creation.

gwiedeman commented 2 years ago

This is on hold since excluding messages for PST outputs (either derivatives or from the original file) isn't feasible. We have to reexamine what the exclusion functionality purpose and use cases are.

gwiedeman commented 2 years ago

Since editing certain types of input files (PST, MSG) isn't feasible, (#25) I think the path for this is to excluding the entire source during packaging. Say I have a PST file, and I only want to package a certain folder. Any folder or message exclusions would also exclude the PST file, and you would have to rely only on derivatives such as MBOX, EML, etc. There would be lossiness in these derivatives, so while I think filtering is a common use case, it would have to be worth the lossiness risk. It raises other issues too, since we basically run a shutil.move on the input files so If we're excluding them we should leave them as-is. I think this path is reasonable, but since the usefulness of this is in doubt, we'll de-prioritize this and comment out the argparse options for now. We'll probably wait until we get community feedback asking for this feature before it will get implemented.