eblerjana / pangenie

Pangenome-based genome inference
MIT License
108 stars 10 forks source link

Error in vcf-merging/pangenome-graph-from-assemblies/Snakemake file #22

Closed LYC-vio closed 1 year ago

LYC-vio commented 1 year ago

Hi,

When I tried to reproduce your pangenome graph, I noticed that there were several errors in vcf-merging/pangenome-graph-from-assemblies/Snakemake file, mainly about generating trio sample list and ped file.

According to function parse_trios in scripts/mendelian-consistency.py,

def parse_trios(ped_file, samples):
    samples_to_include = set()
    with open(samples, 'r') as listing:
        for line in listing:
            samples_to_include.add(line.strip())
    trios = {}
    for line in open(ped_file, 'r'):
        if line.startswith('#'):
            continue
        fields = line.split()
        # map trio_name -> [child, parent1, parent2, nr consistent, nr_inconsistent, nr_untyped, nr total]
        if any([s not in samples_to_include for s in fields[1:4]]):
            continue
        trios[(fields[1], fields[2], fields[3])] = [fields[1], fields[2], fields[3], 0, 0, 0, 0]
    return trios

trio sample list (results/trio-samples.txt) should be like:

HG00731
HG00732
HG00733
HG00512
HG00513
HG00514
NA19238
NA19239
NA19240

However, the Snakemake file generated a file like:

Puerto Rican
HG00731
HG00732
HG00733
Chinese
HG00512
HG00513
HG00514
Yoruban
NA19238
NA19239
NA19240

And ped (results/trios.ped) file should look like:

Puerto_Rican    HG00733 HG00731 HG00732
Chinese HG00514 HG00512 HG00513
Yoruban NA19240 NA19238 NA19239

instead of:

Puerto Rican    Puerto Rican    HG00731 HG00732
Chinese Chinese HG00512 HG00513
Yoruban Yoruban NA19238 NA19239

(Also, trio names should not contain spaces)

To solve the problem above, you may need to modify some lines in Snakemake file:

#################################################################
#  5) Check mendelian consistency for trios and construct graph
#################################################################

# generate a file specifying the trio relationships
rule generate_ped_file:
    output:
        "{outdir}trios.ped"
    run:
        with open(output[0], "w") as ped_output:
            for trio in config['trios']:
                father=config['trios'][trio][0]
                mother=config['trios'][trio][1]
                #Original
                #ped_output.write('\t'.join([trio, trio, father, mother]) + '\n')
                #MODIFED#
                child=config['trios'][trio][2]
                ped_output.write('\t'.join([trio, child, father, mother]) + '\n')

rule generate_samples_file:
    output:
        "{outdir}trio-samples.txt"
    run:
        with open(output[0], "w") as sample_output:
            for trio in config['trios']:
                #sample_output.write(trio + '\n') #MODIFIED#
                for sample in config['trios'][trio]:
                    sample_output.write(sample + '\n')
eblerjana commented 1 year ago

Trios must be specified using the following syntax in the config file:

"trios":
        {   
            "child_sample" : ["father_sample", "mother_sample"]    
        },

For example:

"trios":
        {   
            "HG00733" : ["HG00731", "HG00732"]    
        },

I added this example to the README of the repository.