PlantProteomes / SeqComparison

A project for comparing plant proteome sequences
Apache License 2.0
0 stars 2 forks source link

Next steps #1

Open edeutsch opened 2 years ago

edeutsch commented 2 years ago

python fasta_stats.py ../proteomes/maize/orginal/mitochondrion.2.fasta (shows the summary like you have now) python fasta_stats.py --show_duplicate_identifiers ../proteomes/maize/orginal/mitochondrion.2.fasta python fasta_stats.py --show_duplicate_sequences ../proteomes/maize/orginal/mitochondrion.2.fasta python fasta_stats.py --show_duplicate_descriptions ../proteomes/maize/orginal/mitochondrion.2.fasta

compare_fasta.py input_file_1 input_file_2 Determine:

Make these reusable classes so we can build on this

Build a matrix of overlaps between the all the maize files file1 file2 file3 file1 1000 342 24 file2 600 599 file3 898

primarily by sequences

but also do a matric by identifier

python compare_fasta.py --by_sequence input_file_1 input_file_2 python compare_fasta.py --by_identifier input_file_1 input_file_2

MLi0411 commented 2 years ago

Hi, just to clarify:

"how many sequences unique to each file" is the only part where we are comparing between two files, right? For the rest, "distinct" and "overlapping" refer to distinct and redundant sequences/identifiers within one file? Thanks!

edeutsch commented 2 years ago

You're right, I did mix the concepts a little, but I would say that only the first ("how many distinct sequences in each file") is a property that can be computed from just one file on its own. The rest are all comparing two files. I worded it a bit poorly. Maybe this is better:

Does that make it clearer?

Maybe you can think of other stats that would help us understand how two files are related.

MLi0411 commented 2 years ago

Yup, that makes sense. Thanks!

cashewballerz commented 2 years ago

I'm having an issue recognizing the file locally on my computer. Most of my code should work otherwise. I'll bring it up later today in the meeting. Also, with argparse, do you want this to be applicable to all the fasta files we currently have? Or just to mitochondria.2 to start?

edeutsch commented 2 years ago

okay, great! Yes, let's meet today at 5pm to discuss. You can use argparse to pick up runtime options as well as input files. So I was thinking something like this:

    argparser = argparse.ArgumentParser(description='description of program')
    argparser.add_argument('--n_threads', action='store', type=int, help='Set the number of files to process in parallel (defaults to number of cores)')
    argparser.add_argument('--refresh', action='count', default=0, help='If set, existing metadata for a file will be overwritten rather than skipping the file')
    argparser.add_argument('--verbose', action='count', help='If set, print more information about ongoing processing' )
    argparser.add_argument('files', type=str, nargs='+', help='Filenames of one or more mzML files to read')
    params = argparser.parse_args()

    #### Loop over all the files to ensure that they are really there before starting work
    for file in params.files:
        if not os.path.isfile(file):
            print(f"ERROR: File '{file}' not found or not a file")
            return

(this is lifted from a different program so the exact options don't make sense for what you're doing, but it shows how to specify options and filenames using argparse)