apriha / snps

tools for reading, writing, merging, and remapping SNPs
BSD 3-Clause "New" or "Revised" License
98 stars 19 forks source link

AttributeError: 'str' object has no attribute '_output_dir' #120

Open lakishadavid opened 3 years ago

lakishadavid commented 3 years ago

I'm using aws ec2 ubuntu. It does not allow me to create an individual.

user662 = l.create_individual('User662', '/home/ubuntu/myprojectdir/AaronAzuma.zip') Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/myprojectdir/venv/lib/python3.8/site-packages/lineage/init.py", line 96, in create_individual return Individual(name, raw_data, self._output_dir, **kwargs) AttributeError: 'str' object has no attribute '_output_dir'

apriha commented 3 years ago

Thanks for the issue. Can you provide more details or code snippets? I just tested installing and running the README examples in a Python 3.8 virtual environment without any issues.

lakishadavid commented 3 years ago

Thanks Andrew,

On using your example data and the create_individual working, I realized that my issue was with the parsing. I already converted the format from AncestryDNA to 23andMe and then tried to use create_indidual. I receive the parsing error, which then doesn't allow me to go forward. My other set of files also have 4 columns like 23andMe but no headers (from the H3Africa array with another lab).

$ sed -n 1,20p lineage/inputs/myfile.txt

AncestryDNA raw data download

This file was generated by AncestryDNA at: 07/31/2018 23:48:22 UTC

Data was collected using AncestryDNA array version: V2.0

Data is formatted using AncestryDNA converter version: V1.0

... rsid chromosome position allele1allele2 rs369202065 1 569388 GG

$ python manage.py shell Python 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] on linux

from lineage import Lineage l = Lineage()

user111 = l.create_individual('User111', 'myfile.txt') pandas.errors.ParserError: Too many columns specified: expected 5 and found 4

LaKisha

On Sun, Jan 17, 2021 at 11:14 PM Andrew Riha notifications@github.com wrote:

Thanks for the issue. Can you provide more details or code snippets? I just tested installing and running the README examples in a Python 3.8 virtual environment without any issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apriha/lineage/issues/84#issuecomment-761984850, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHHGO657CSAN6PJROW3PCTS2O7SFANCNFSM4WGHH4IQ .

apriha commented 3 years ago

Thanks LaKisha, that helps. lineage uses the snps library to parse files, so I transferred the issue here.

snps should be able to read raw AncestryDNA or 23andMe files without conversion... However, snps could be updated to handle the format you pasted as well. Do you have a link to the tool that produces that format?

As for the H3Africa files, can you confirm that an example file would look like this (tab-separated):

rs1 1   101 AA
rs2 1   102 CC
rs3 1   103 GG
rs4 1   104 TT
rs5 1   105 --
rs6 1   106 GC
rs7 1   107 TC
rs8 1   108 AT
.
.
.
lakishadavid commented 3 years ago

Hi Andrew,

Here is the script I'm using to convert my files from AncestryDNA to 23andMe format:

(venv) ubuntu@:~/myprojectdir/lineage/inputs$ for file in ./*.txt; do echo "converting from AncestryDNA to 23andMe format file:" $file; gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' $file; done

This line results in a text file that looks like this:

rsid chromosome position allele1allele2 rs369202065 1 569388 GG rs199476136 1 569400 TT rs3131972 1 752721 AG rs114525117 1 759036 GG rs12124819 1 776546 AA rs4040617 1 779322 AA rs141175086 1 780397 CC rs115093905 1 787173 GG rs11240777 1 798959 AG

The H3Africa file looks like this after using the command line (tab): h3a_37_1_54676_C_T 1 54676 AA seq-h3a_37_1_61989_G_C 1 61989 CC seq-h3a_37_1_62271_A_G 1 62271 AA seq-h3a_37_1_64552_G_A 1 64552 AA seq-h3a_37_1_104072_C_T 1 104072 GG h3a_37_1_108310_T_C 1 108310 AA h3a_37_1_110509_G_A 1 110509 GG seq-h3a_37_1_118617_T_C 1 118617 GG seq-h3a_37_1_256586_T_G 1 256586 AC h3a_37_1_404672_G_A 1 404672 AA kgp15717912 1 534247 GG

If it helps, I'm sharing with you that after converting to 23andMe format, I convert it to VCF format to use downline. Your tool is really quick, plus the graph. It would be great if I could use it my pipeline. Here's my 23andMe to VCF conversion:

(venv) ubuntu@:~/myprojectdir/lineage/inputs$ for file in ./*txt; do echo "converting to vcf file:" $file; bcftools convert -c ID,CHROM,POS,AA -s ${file%.txt} --haploid2diploid -f ../references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --tsv2vcf $file -Oz -o ${file%.txt}.vcf.gz; done

Index multiple vcf files in prep to merge

for file in ./*.vcf.gz; do echo "indexing vcf file" $file; tabix $file; done

Merge multiple vcf file into single vcf file

bcftools merge -Oz -o MergedSamples1.vcf.gz ../inputs/*.vcf.gz

Clean MergedSamples file

bgzip -d ../results/MergedSamples.vcf.gz grep ^"#" ../results/MergedSamples.vcf > ../results/MergedSamples0.vcf awk -F$'\t' '{ if ( $3 ~ "rs" ) { print $0; } }' ../results/MergedSamples.vcf > ../results/MergedSamples1.vcf awk -F$'\t' '{ if ( $3 !~ ";" ) { print $0; } }' ../results/MergedSamples1.vcf > ../results/MergedSamples2.vcf cat ../results/MergedSamples0.vcf ../results/MergedSamples2.vcf > ../results/MergedSamplesEdited.vcf sed -n 1,20p MergedSamplesEdited.vcf gawk -i inplace '!a[$2]++' ../results/MergedSamplesEdited.vcf bgzip ../results/MergedSamplesEdited.vcf

On Mon, Jan 18, 2021 at 11:34 PM Andrew Riha notifications@github.com wrote:

Thanks LaKisha, that helps. lineage uses the snps library to parse files, so I transferred the issue here.

snps should be able to read raw AncestryDNA or 23andMe files without conversion... However, snps could be updated to handle the format you pasted as well. Do you have a link to the tool that produces that format?

As for the H3Africa files, can you confirm that an example file would look like this (tab-separated):

rs1 1 101 AA rs2 1 102 CC rs3 1 103 GG rs4 1 104 TT rs5 1 105 -- rs6 1 106 GC rs7 1 107 TC rs8 1 108 AT .. . ..

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apriha/snps/issues/120#issuecomment-762612276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHHGO3YAISSB3V4FHC7HRLS2UKWNANCNFSM4WIHI47A .

apriha commented 3 years ago

Thanks LaKisha. The issue with snps / lineage not being able to parse your converted file is because it's trying to apply the AncestryDNA parser based on the comments, and for that it looks for whitespace between the alleles and column headers.

But, you don't need to convert the file since snps can read AncestryDNA (and the other formats discussed in the README already. Give that a try and let me know how it works.

As for the H3Africa file, snps should also be able to read that.

And if you need a VCF file, you can save the SNPs in VCF format.

apriha commented 3 years ago

Closing since there are no updates required for this issue.

apriha commented 3 years ago

Sorry, I closed the issue too early. Upon further investigation, snps should be updated to handle the H3Africa format since the generic parser is not invoked (an rsid is not in the first line). Also, the generic parser wouldn't be able to parse this due to multiple whitespace.

So to handle this, snps could either (or both)

lakishadavid commented 3 years ago

Hi Andrew, I tried again with fresh AncestryDNA zip files. I'm still getting the same error message.

s = SNPs("/home/ubuntu/myprojectdir/lineage/inputs/Person1.zip") s.source 'AncestryDNA' s.build 37 s.assembly 'GRCh37' s.count Traceback (most recent call last): File "", line 1, in AttributeError: 'SNPs' object has no attribute 'count' user662 = l.create_individual('User662', '/home/ubuntu/myprojectdir/lineage/inputs/Person1.zip') Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/myprojectdir/venv/lib/python3.8/site-packages/lineage/init.py", line 96, in create_individual return Individual(name, raw_data, self._output_dir, **kwargs) AttributeError: 'str' object has no attribute '_output_dir'

On Sun, Jan 24, 2021 at 10:44 PM Andrew Riha notifications@github.com wrote:

Sorry, I closed the issue too early. Upon further investigation, snps should be updated to handle the H3Africa format since the generic parser is not invoked (an rsid is not in the first line). Also, the generic parser wouldn't be able to parse this due to multiple whitespace.

So to handle this, snps could either (or both)

  • check if "h3a" is in the first line and apply a parser similar to the AncestryDNA parser with multiple whitespace
  • apply a generic parser as a last check that tries to read four or five column files with multiple whitespace

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apriha/snps/issues/120#issuecomment-766536123, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHHGO5WMVIZRNBAODDLNDLS3TZJ5ANCNFSM4WIHI47A .

apriha commented 3 years ago

Hi @lakishadavid , please try to create a new virtual environment and install lineage again - I've updated it to support the latest version of snps. FYI, here are some additional installation directions: https://lineage.readthedocs.io/en/latest/installation.html .