Atkinson-Lab / Tractor

Scripts for implementing the Tractor pipeline
MIT License
44 stars 5 forks source link

`extract_tracks.py`: NameError: name 'output_files' is not defined #26

Closed biona001 closed 9 months ago

biona001 commented 9 months ago

Hi,

I was trying to run the extract_tracks.py script, but it threw an error

$ python3 /u/home/b/biona001/Tractor/scripts/extract_tracts.py \
    --vcf /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/vcf_phased/chr22.vcf.gz \
    --msp /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/output/chr22.msp.tsv \
    --num-ancs 3 \
    --output-dir /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/tracks

INFO (__main__ 90): # VCF File                    : /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/vcf_phased/chr22.vcf.gz
INFO (__main__ 91): # Prefix of output file names : chr22
INFO (__main__ 92): # VCF File is compressed?     : True
INFO (__main__ 93): # Number of Ancestries in VCF : 3
INFO (__main__ 94): # Output Directory            : /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/tracks
INFO (__main__ 100): Creating output files for 3 ancestries
Traceback (most recent call last):
  File "/u/home/b/biona001/Tractor/scripts/extract_tracts.py", line 239, in <module>
    extract_tracts(**vars(args))
  File "/u/home/b/biona001/Tractor/scripts/extract_tracts.py", line 102, in extract_tracts
    output_files[f"dos{i}"] = f"{output_path}anc{i}.dosage.txt{file_extension}"
NameError: name 'output_files' is not defined

Any tips/suggestions would be highly appreciated.

biona001 commented 9 months ago

In extract_tracts function I tried adding the line

output_files = {}

but it led to another error

NameError: name 'geno_parts' is not defined

Then I modified the variable geno_parts to just geno, and the script seemed to work.

EfraMP commented 9 months ago

I got the exact same error, and after doing those modifications I'm getting the next error:

$ python3 Tractor/scripts/extract_tracts.py --vcf subset1/query_file_phased.vcf --msp subset1/query_results.msp --output-dir output/ --num-ancs 8
INFO (__main__ 91): # VCF File                    : subset1/query_file_phased.vcf
INFO (__main__ 92): # Prefix of output file names : query_file_phased
INFO (__main__ 93): # VCF File is compressed?     : False
INFO (__main__ 94): # Number of Ancestries in VCF : 8
INFO (__main__ 95): # Output Directory            : output/
INFO (__main__ 101): Creating output files for 8 ancestries
INFO (__main__ 116): Iterating through VCF file
Traceback (most recent call last):
  File "/path/Tractor/scripts/extract_tracts.py", line 240, in <module>
    extract_tracts(**vars(args))
  File "/patj/Tractor/scripts/extract_tracts.py", line 170, in extract_tracts
    window = (ancs_entry[0], int(ancs_entry[1]), int(ancs_entry[2]))
ValueError: invalid literal for int() with base 10: '0.0'

Any idea?

samreenzafer commented 9 months ago

I was getting the same errors so I switched to using the older version of tractor scripts.i.e. V0.0.1 from GitHub which came with the following files

-rwxrwxrwx 1 2.3K UnkinkMSPfile.py -rwxrwxrwx 1 2.5K UnkinkGenofile.py -rwxrwxrwx 1 19K TractorIcon.png -rwxrwxrwx 1 5.2K Tractor-Example-GWAS.py -rwxrwxrwx 1 29K Tractor-Example-GWAS-FromHailformat.ipynb -rwxrwxrwx 1 8.1K RunTractor.py -rwxrwxrwx 1 1.6K README.md -rwxrwxrwx 1 884 hg19_telomeres.bed -rwxrwxrwx 1 8.7K ExtractTracts.py -rwxrwxrwx 1 5.2K ExtractTracts-legacy-2way.py drwxrwxrwx 8 4.0K .git drwxrwxrwx 2 4.0K Example

samreenzafer commented 9 months ago

In extract_tracts function I tried adding the line

output_files = {}

but it led to another error

NameError: name 'geno_parts' is not defined

Then I modified the variable geno_parts to just geno, and the script seemed to work.

I tried your modifications to the script in the latest version (extract_tracts.py) and as you mentioned the older errors vanished. However, I now get another error, at the same geno[:2] statement. I'm not familiar with python, so don't know how to debug the code.

python Tractor/scripts/extract_tracts.py --vcf mergedCNICs.chr22.vcf.gz --msp mergedCNICs.chr22.deconvoluted.msp.tsv --num-ancs 3 --output-dir pwd/ --output-vcf --compress-output INFO (main 90): # VCF File : mergedCNICs.chr22.vcf.gz INFO (main 91): # Prefix of output file names : mergedCNICs.chr22 INFO (main 92): # VCF File is compressed? : True INFO (main 93): # Number of Ancestries in VCF : 3 INFO (main 94): # Output Directory : /hpc/users/zafers02/HIV_SamreenPaola/analysis/tractor/localancestry_vcf/ INFO (main 100): Creating output files for 3 ancestries INFO (main 116): Iterating through VCF file INFO (main 175): VCF position, 16055070 is not in an msp window, skipping site INFO (main 175): VCF position, 16055122 is not in an msp window, skipping site INFO (main 175): VCF position, 16055942 is not in an msp window, skipping site INFO (main 175): VCF position, 16056936 is not in an msp window, skipping site INFO (main 175): VCF position, 16057417 is not in an msp window, skipping site INFO (main 175): VCF position, 16231396 is not in an msp window, skipping site INFO (main 175): VCF position, 16285185 is not in an msp window, skipping site INFO (main 175): VCF position, 16287784 is not in an msp window, skipping site INFO (main 175): VCF position, 16341421 is not in an msp window, skipping site INFO (main 175): VCF position, 16347845 is not in an msp window, skipping site INFO (main 175): VCF position, 16364176 is not in an msp window, skipping site INFO (main 175): VCF position, 16364981 is not in an msp window, skipping site INFO (main 175): VCF position, 16393312 is not in an msp window, skipping site INFO (main 175): VCF position, 16419883 is not in an msp window, skipping site INFO (main 175): VCF position, 16428267 is not in an msp window, skipping site INFO (main 175): VCF position, 16433369 is not in an msp window, skipping site INFO (main 175): VCF position, 16494800 is not in an msp window, skipping site INFO (main 175): VCF position, 16495458 is not in an msp window, skipping site INFO (main 175): VCF position, 16497038 is not in an msp window, skipping site INFO (main 175): VCF position, 16498839 is not in an msp window, skipping site INFO (main 175): VCF position, 16504399 is not in an msp window, skipping site INFO (main 175): VCF position, 16517801 is not in an msp window, skipping site INFO (main 175): VCF position, 16548272 is not in an msp window, skipping site INFO (main 175): VCF position, 16568293 is not in an msp window, skipping site INFO (main 175): VCF position, 16572329 is not in an msp window, skipping site INFO (main 175): VCF position, 16610224 is not in an msp window, skipping site INFO (main 175): VCF position, 16638894 is not in an msp window, skipping site INFO (main 175): VCF position, 16649679 is not in an msp window, skipping site INFO (main 175): VCF position, 16650301 is not in an msp window, skipping site INFO (main 175): VCF position, 16850308 is not in an msp window, skipping site INFO (main 175): VCF position, 16855160 is not in an msp window, skipping site Traceback (most recent call last): File "Tractor/scripts/extract_tracts.py", line 240, in extract_tracts(**vars(args)) File "Tractor/scripts/extract_tracts.py", line 181, in extract_tracts geno_a,geno_b = map(str, geno[:2]) ValueError: not enough values to unpack (expected 2, got 1)

samreenzafer commented 9 months ago

I figured this error out.

File "Tractor/scripts/extract_tracts.py", line 181, in extract_tracts geno_a,geno_b = map(str, geno[:2]) ValueError: not enough values to unpack (expected 2, got 1)

There were variants in the phased which had no-calls, and this was causing the trouble. I originally had 3 cohorts, on which I did phasing separately, and then merged (bcftools merge) the phased vcf files to proceed with LAI and Tractor. While merging variants some samples which did not have variants in other 2 cohorts were given ./. genotypes. Extract_tractor could not handle such genotypes.

I then did a bcftools +missing2ref to the merged_phased vcf files, and now the subsequent tractor steps work properly. bcftools merge --threads $threads -o $output.vcf.gz -O z cohort1.chr$i.phase.vcf.gz cohort2.chr$i.phase.vcf.gz cohort2.chr$i.phase.vcf.gz | bcftools +missing2ref -O z -o $output.missing2ref.vcf.gz -p -m

nirav572 commented 9 months ago

Hi @biona001, Thank you for the bug report and the recommended fix. We have implemented the necessary changes in the code, and should work once you pull the code to the most recent version.

nirav572 commented 9 months ago

@EfraMP A quick search tells me this might have to do with data types. We are happy to look into it, but please open another issue for this. If you can details or provide an example of files you are inputting so we can replicate on our end, that would be helpful.

@samreenzafer You are correct, extract_tracts.py does expect that there are no missing values in the VCF files.