As currently written, the left join with peptide_predictions will duplicate all of the information in all of the left-joined dataframes for any peptides that appear more than once in peptide_predictions. Outputting two TSV files - one the predictions for peptides (in which peptide_id is nonunique) and the other the joined metadata dataframes (in which peptide_id is unique) - would avoid this. Outputting two would avoid duplication, but one might be easier to work with practically. We should assess this after running the pipeline a few times. if there is relatively little duplication in peptide_ids it might not matter in practice to separate these two files.
As currently written, the left join with
peptide_predictions
will duplicate all of the information in all of the left-joined dataframes for any peptides that appear more than once inpeptide_predictions
. Outputting two TSV files - one the predictions for peptides (in whichpeptide_id
is nonunique) and the other the joined metadata dataframes (in whichpeptide_id
is unique) - would avoid this. Outputting two would avoid duplication, but one might be easier to work with practically. We should assess this after running the pipeline a few times. if there is relatively little duplication in peptide_ids it might not matter in practice to separate these two files.